-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove COLUMN_DTYPES and switch to field metadata dictionary #1408
Conversation
Rather than storing two separate sources of column/field type information, derive all of the pandas data types from the now complete(-ish?) Pydantic metadata. Mostly this is implemented by a small helper function pudl.metadata.fields.get_pandas_dtypes() which refers to the canonical PUDL field metadata (and data group overrides) by default, but which can also use independent field type dictionaries, so it can be used to enforce types on analysis tables with an ad-hoc mix of canonical PUDL fields and other columns that pertain to that analytical context. One (temporary) change that was required to make this work is that all of the _year fields are now being typed as integers, rather than years, to avoid having them be converted into nanoseconds after Jan 1st, 1970. When we try to handle the year, month, and day resolutions of reporting in the harvesting, and merging dataframes of different frequencies together, we'll want to undo this, but for now it maintains the existing behavior and doesn't break anything, and it was a small change. Closes #1270
Codecov Report
@@ Coverage Diff @@
## dev #1408 +/- ##
==========================================
+ Coverage 83.49% 83.53% +0.04%
==========================================
Files 64 64
Lines 6910 6921 +11
==========================================
+ Hits 5769 5781 +12
+ Misses 1141 1140 -1
Continue to review full report at Codecov.
|
If we want to use the Resource.format_df() class to correct dtypes for arbitrary dataframe that have fields specified in class Resource(base):
...
@classmethod
def from_dataframe(cls, df: pd.DataFrame) -> Resource:
"""Create a Resource for a dataframe whose fields exist in the metadata system."""
fields = []
for field in df.columns:
# This will fail if `field` is not specified in FIELD_METADATA
fields.append(Field.dict_from_id(field))
schema = {"fields": fields}
# Add any additional required params for Resource schemas
return cls(**fields) Use: # An arbitrary dataframe with fields that exist in FIELD_METADATA
gen_df = pd.DataFrame({"fuel_type": ["gas", "coal"], "generator_id": ["1A", "2B"]})
# create the schema
gen_schema = Resource.from_dataframe(gen_df)
# validate and format the dataframe
gen_df = gen_schema.format_df(gen_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments and the proposal above.
Re: using |
|
Ways that
|
Tracked down all the instances in which we were using `pudl.helpers.get_pudl_dtypes()` and replaced them with the new function call. This is an incremental step toward standardizing the way we enforce data types on dataframes even when they aren't fully specified by our metadata / database schema information. This affected the way `pudl.helpers.clean_merge_asof()` is called as well. Removed `pudl.helpers.get_pudl_dtype()` (singular) as it was duplicative given the `pudl.metadata.fields.get_pandas_dyptes()` function.
It turns out that in most if not all of the cases where we are applying the canonical PUDL dtypes, what we're actually doing is forcing all non-floats to use nullable Pandas dtypes... and the built-in df.convert_dtypes() method does this. I've hunted down and replaced all usage ofget_pudl_dtypes() with convert_dtypes() which should avoid the need for our own typing infrastructure in almost all cases. Excluding the conversion of floating point values is necessary since the pandas nullable floats don't work with all of the Numpy functions. There may be other instances (e.g. boolean columns with NA values that we use as masks) where this doesn't work either... but haven't come across any just yet.
Consolidated the PUDL data type application functions into pudl.metadata.fields: * get_pudl_dtypes() gets an exhaustive list of all defined fields and their types. * apply_pudl_dtypes(df) applies the relevant dtypes to the columns in a dataframe. Swapped in apply_pudl_dtypes() in the places where I had just switched to using df.convert_dtypes() since that had a fair number of bad side effects, particularly with respect to datetime columns.
In almost all cases simply applying PUDL data types to dataframes is enough to make them behave well in merges etc. The exceptions are currently: * `pudl.transform.eia861_compare_totals()` (which has issues w/ bools & NA values) and * The end of the main eia, eia861, and ferc1 transform steps where we really are cleaning up bad values in addition to applying canonical data types. In these cases it's being called indirectly by `convert_dfs_dict_dtypes()` There are also some calls to convert_cols_dtypes() in test/unit/analysis/allocate_net_gen.py but I'm not sure if those can/should be switched.
Removed some additional unnecessary calls to convert_cols_dtypes, and replaced calls to convert_dfs_dict_dtypes with dictionary comprehensions using convert_cols_dtypes. Replaced the use of convert_cols_dtypes that was being applied to the entity tables post harvesting with apply_pudl_dtypes() since all of the input values should have been cleaned up prior to harvesting. The remaining uses of convert_cols_dtypes are: * At the end of the EIA transformations, in the ETL module. * At the end of the EIA-861 transformations (analogous to the above) * At the end of the FERC-1 transformations * One use in the midst of the EIA-861 transformations to deal with bools/NAs. Did a little bit of debugging of #1371 but ultimately left it for @cmgosnell. Did rename the allocate_net_gen test module so pytest will pick it up, but marked the tests xfail. Ran `tox -e nuke` overnight to see if any of the previous changes had affected the data validations. The only changes were a few additional rows showing up in the generation_fuel_eia923 and generation_fuel_nuclear_eia923 tables. I updated the expected row counts appropriately... but not 100% sure what the changes there would be. Should compare with a previous version of the DB.
Also looked into why there are a few more records showing up in the generation_fuel and generation_fuel_nuclear tables now... It's not clear, but they look like valid records, and they are all from 2001. It seems that this change happened after we released PUDL v0.5.0 and before I started messing around with the data types though -- the "additional" records were present in a copy of the PUDL DB that I had from 2021-11-28, but were not there on 2021-11-11.
I think building an ad-hoc I think we'll also want to make it easier to inject types for ad-hoc dataframes which don't have Resource definitions, but @cmgosnell said that in the depreciation cases where she was doing this, just using |
Fixed a variety of small bugs in the EIA-861 metadata.
Removed the duplicative
pudl.constants.COLUMN_DTYPES
dictionary, and am now referring exclusively to the data types which come frompudl.metadata.fields
At some level this is circumventing the Field/Resource/Package API by referring directly to the field metadata dictionaries, which isn't great. The functionality that we're going for here is being able to impose data types on a random dataframe which contains some PUDL columns, even if it's not one of the canonical database tables, or contains a mix of fields from several different tables / data sources.
However the more legitimate alternative -- of defining full Resource definitions for these other tables -- feels like a lot of overhead, when we are apparently frequently re-applying types in analyses and dataframe manipulations.
Is there a more robust way to implement this that's still low overhead for ad-hoc use?
See the individual commit messages for additional details on what was changed.