-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype dagster-pandera integration #3282
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some preliminary comments
cc @zaneselvans, who I think has done a lot of the recent touching of this, and @cmgosnell + @e-belfer , who have recently expressed some desire for using our hard-won schema information to annotate our asset types - not tagging for review since it's not ready yet, but just an FYI. |
Something to note is dagster's addition of asset checks which allow you to run arbitrary tests after an asset executes. The feature is still experimental but seems pretty useful. I asked how asset checks and something like dagster-pandera compare. If we want to use pandera + dagster asset checks they recommend validating the asset using pandera inside of the asset check instead of using dagster-pandera. |
Ooh, thanks @bendnorman ! Using something that's more "core" dagster is better than using something that's less core :) |
I'm not sure if it matters but it seems like there's a conflation of 2 related functions here: asset type checking, and more complex data validations that look at the data contents, beyond just the schema (though I guess you could consider constraints on the values and relationships between columns part of a more complicated "type" with the validations that Pydantic does). But if we use an asset check and happen to use Pandera inside it, will the dataframe typing information still be available to Dagster? Or to us in the IDE? It sounded like there was a vague plan to updated
|
6230349
to
7157ca3
Compare
@@ -1001,6 +1058,69 @@ class ResourceHarvest(PudlMeta): | |||
"""Fraction of invalid fields above which result is considerd invalid.""" | |||
|
|||
|
|||
class PudlResourceDescriptor(PudlMeta): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this class so that we have a proper description/enforcement of the shape of RESOURCE_METADATA
s. This should make it easier for someone to understand how to add a new resource.
Ran into a couple things that are a little funky, documented as TODOs.
@@ -1185,8 +1305,18 @@ def _check_harvest_primary_key(cls, value, info: ValidationInfo): | |||
return value | |||
|
|||
@staticmethod | |||
def dict_from_id(x: str) -> dict: # noqa: C901 | |||
"""Construct dictionary from PUDL identifier (`resource.name`). | |||
def dict_from_id(resource_id: str) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was surprisingly simple to run all of our RESOURCE_METADATA
through the new type and validation machinery!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this uses the new high-level composite PudlResourceDescriptor
to validate whatever we've encoded in our giant dictionary of doom, rather than waiting for each of the individual subcomponents to get instantiated and validated? The main goal being to ensure that we have a complete explicit, rather than implicit, description of what needs to be defined for PUDL at the resource / table level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Basically:
- It's helpful to have something like
PudlResourceDescriptor
to explicitly encode our bespoke resource descriptor data structure! - It's... much less helpful if we don't actually enforce that that's what we're working with in the rest of the code.
- This should also blow up on incorrectly defined resources at validation time, instead of when we're then trying to do all this transformation logic to them.
src/pudl/validate.py
Outdated
@@ -1533,6 +1554,7 @@ def plot_vs_agg(orig_df, agg_df, validation_cases): | |||
"low_bound": 0.95, | |||
"data_col": "fuel_mmbtu_per_unit", | |||
"weight_col": "fuel_consumed_units", | |||
"xfail": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worthwhile to define a BoundsCheckParams
class that all these bounds checks conform to, instead of having them be dictionaries that can be anything. It's not necessary for this code to work, though. What do people think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module is a horror show written long ago, before we knew about Pydantic or data classes or typed dicts. It's long overdue for an overhaul. I don't even know if it should continue existing after we rip out PudlTabl
and change how the tests and validations access the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to be running the validations during the ETL as soon as the relevant table has been created, where does it make the most sense to define the validations that apply to that table? Should it be adjacent to the asset definitions (where all of the cleaning / munging is actually happening)? And then referenced from the Resource definition? Or should it actually be in the resource definition? Or would we want to continue to have a separate module or set of modules dedicated to defining the validations like we do now?
At the same time we probably want to develop a library of generic kinds-of-checks that we'll end up using in many different contexts, instantiating them with appropriate parameters when they're defined for the individual tables that they apply to. So maybe we turn pudl.validations
into the place where those validation classes / types are defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what makes the most sense is:
- Use the PUDL resource schemas to define the data type validations. If e.g. "everything in this column should be >10" that seems like it lives in the schema.
- More complicated validations should be defined alongside the assets - you can just define an
@asset_check
right in the module, in which case we'll have to add someload_asset_checks_from_module
calls toetl/__init__.py
. - We should have utility functions + types related to validations in the
pudl.validations
module.
I don't think we need to then reference the more complicated validations alongside the schema in RESOURCE_METADATA.
7157ca3
to
2e2c27f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is great! Huge progress toward a better validation setup, and yet not huge changes. I had some minor clarifying questions and requests for docstrings, but I think we should get this in as incremental progress and think about whether we want to move on to anything mentioned below:
Since we have so much new data coming into the DB right now, I think it would be helpful to create some more aspirational templates / examples of how we actually want the data validations to look going forward, so that as we add new cases, we can be doing it the Right Way. The factory wrapper for converting the old validation feels little opaque for this purpose.
So it would be nice to have examples of both new df_checks
and field_checks
which are not tied to the old data validation implementation (though the underlying content could be drawn from those validations if we want)
Probably this would mean implementing at least one parametrized check class and migrating at least some of the existing data validations into that container completely, not just by reference to pudl.validate
Running the validations in the ETL will require more compute, but I think there's likely to be a big overall performance speedup for the validations, given that the way the validations are currently organized, the same dataframes are being read out of the DB and into memory repeatedly (since the tests are grouped by type of validation, not by which dataframe they apply to). This also results in memory usage blowing up if you try and run the data validations on a machine with lots of cores with pytest-xdist.
And then of course we'll also discover violations of our validation checks immediately in the ETL instead of in the nightly builds, which means we'll fix them sooner.
IIRC by default failing asset checks don't cause the ETL to fail, do they? How will we be notified when a check fails? Do we want to set them to ERROR out on check failure? I think I saw that was a new option as of the most recent dagster release.
Are there any checks that we can't encode this way? I guess any check (like referential integrity) that involves more than one table won't work.
asset_key: AssetKey, | ||
package: pudl.metadata.classes.Package, | ||
) -> AssetChecksDefinition | None: | ||
"""Create a dagster asset check based on the resource schema, if defined.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect there to be cases in which a Resource
does not have a Schema defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think so. But I do expect there to be assets that are built without corresponding Resource
s - e.g. raw_*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. I think the docstring could be clearer then. "if defined" is not "if there's a schema defined for this resource" but "if there's a resource defined for this asset."
warnings.filterwarnings("ignore", category=ExperimentalWarning) | ||
_package = pudl.metadata.classes.Package.from_resource_ids() | ||
_asset_keys = itertools.chain.from_iterable( | ||
_get_keys_from_assets(asset_def) for asset_def in default_assets | ||
) | ||
default_asset_checks = [ | ||
check | ||
for check in ( | ||
asset_check_from_schema(asset_key, _package) for asset_key in _asset_keys | ||
) | ||
if check is not None | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the default asset checks relate to the default assets / how are they associated with each other? Are they just identified with each other based on the ordering of these lists that are passed into the definition of defs
? Or is there some key within the assets and the checks that matches them up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're associated by explicitly linking them in the asset
param you pass to @asset_check
.
src/pudl/metadata/classes.py
Outdated
df_checks: list[Callable] = [] | ||
field_checks: dict[SnakeCase, list[Callable]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this shows up below but, where do you envision these additional (beyond the schema) table/column level checks being defined?
src/pudl/validate.py
Outdated
@@ -1533,6 +1554,7 @@ def plot_vs_agg(orig_df, agg_df, validation_cases): | |||
"low_bound": 0.95, | |||
"data_col": "fuel_mmbtu_per_unit", | |||
"weight_col": "fuel_consumed_units", | |||
"xfail": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to be running the validations during the ETL as soon as the relevant table has been created, where does it make the most sense to define the validations that apply to that table? Should it be adjacent to the asset definitions (where all of the cleaning / munging is actually happening)? And then referenced from the Resource definition? Or should it actually be in the resource definition? Or would we want to continue to have a separate module or set of modules dedicated to defining the validations like we do now?
At the same time we probably want to develop a library of generic kinds-of-checks that we'll end up using in many different contexts, instantiating them with appropriate parameters when they're defined for the individual tables that they apply to. So maybe we turn pudl.validations
into the place where those validation classes / types are defined?
Sweet! Yeah, I was a bit bummed that I let the scope creep up (from Do One Asset to Do A Bunch of Assets + Make Some Ergonomic Improvements) , but I think the extra work was very high bang for buck.
Agreed - I think this would be a great follow-up! We should define those directly alongside the assets, I think, as mentioned above. Since
Yeah, sounds right.
Yes! Agreed on these benefits :)
I think they don't cause the ETL to stop running but they do cause the whole run to be marked as a "failure". I can make a bogus failing test and run
Asset checks have additional_ins which should let us encode checks across multiple tables. I think the upshot of all this is that we need the following before we merge:
And we want to encode some validations in-line instead of in-schema as a follow-up PR. For that one, we could even rip out the "df_checks" and "field_checks" machinery in this PR, and say:
If we agree on ^, should we then rip out the df_checks and field_checks machinery in this PR pre-emptively? |
391b637
to
afa314b
Compare
Made a "minimal checks" issue here: #3412 - @zaneselvans this should be ready for re-review! |
…ecks. Lots of ergo improvements to be had.
Still remaining: generate asset checks programmatically, instead of with laborious manual typing.
afa314b
to
a4c4ab6
Compare
Overview
Relevant to #1572, but doesn't close it. Opens the possibility of closing it.
We need to integrate:
Fortunately, there's already integration between pandera and dagster. We just need a shim layer between our existing schema definition system and pandera - hence, this PR.
Changes:
See details for the original rant, but I ended up introducing some classes to define our existing
RESOURCE_METADATA
data structure. That lays the groundwork for moving towards using thefrictionless
library in the future, as well as further refactoring to make our metadata handling interoperate with various different libraries like we discussed in https://github.com/orgs/catalyst-cooperative/discussions/2546.Our existing schema definition system is hard to change for many reasons; one of which is that the shape of the data in
RESOURCE_METADATA
is actually not documented anywhere! Instead we have all sorts of weird logic scattered throughout to turn that undocumented shape into something resembling Frictionless Packages/Resources/Fields (but that are distinct classes...)One thing we could do to make this whole thing a bit more tractable is to actually define some sort of
ResourceMetadata
orResourceSpec
class (and, potentially, companionSchemaSpec
,FieldSpec
classes) that:RESOURCE_METADATA
Right now, our
Resource
(and companion) classes happens to be "a copy of the frictionless class, sort of, that knows how to translate from the undefined shape of what's inRESOURCE_METADATA
into our ersatz frictionless class." Which causes much entanglement/confusion.It doesn't seem like it would be too big of a refactor, to split up
Resource
intoResourceSpec
and the officialfrictionless.Resource
, so maybe we should do that sometime soon.Testing
How did you make sure this worked? How can a reviewer verify this?
I ran the whole ETL, and saw that all the asset checks passed. I also happened to pick an asset which has some validation checks we expect to fail - so I saw that failing checks did actually show up in Dagster, before then adding machinery to xfail certain checks.
To-do list