Prototype dagster-pandera integration #3282

jdangerx · 2024-01-24T00:30:52Z

Overview

Relevant to #1572, but doesn't close it. Opens the possibility of closing it.

We need to integrate:

our existing resource schema definition system
Pandera
dagster

Fortunately, there's already integration between pandera and dagster. We just need a shim layer between our existing schema definition system and pandera - hence, this PR.

Changes:

shim layer between our schema definitions & pandera
create dagster asset checks for all assets that correspond to a Resource
port boiler fuel validation checks to resource schema

See details for the original rant, but I ended up introducing some classes to define our existing RESOURCE_METADATA data structure. That lays the groundwork for moving towards using the frictionless library in the future, as well as further refactoring to make our metadata handling interoperate with various different libraries like we discussed in https://github.com/orgs/catalyst-cooperative/discussions/2546.

Our existing schema definition system is hard to change for many reasons; one of which is that the shape of the data in RESOURCE_METADATA is actually not documented anywhere! Instead we have all sorts of weird logic scattered throughout to turn that undocumented shape into something resembling Frictionless Packages/Resources/Fields (but that are distinct classes...)

One thing we could do to make this whole thing a bit more tractable is to actually define some sort of ResourceMetadata or ResourceSpec class (and, potentially, companion SchemaSpec, FieldSpec classes) that:

describes what we expect from RESOURCE_METADATA
knows how to translate from itself into various different formats: frictionless, pandera, pyarrow, pandas, etc.

Right now, our Resource (and companion) classes happens to be "a copy of the frictionless class, sort of, that knows how to translate from the undefined shape of what's in RESOURCE_METADATA into our ersatz frictionless class." Which causes much entanglement/confusion.

It doesn't seem like it would be too big of a refactor, to split up Resource into ResourceSpec and the official frictionless.Resource, so maybe we should do that sometime soon.

Testing

How did you make sure this worked? How can a reviewer verify this?

I ran the whole ETL, and saw that all the asset checks passed. I also happened to pick an asset which has some validation checks we expect to fail - so I saw that failing checks did actually show up in Dagster, before then adding machinery to xfail certain checks.

To-do list

Give feedback

use Pandera to express real data type and validation tests for an asset in Dagster
Make sure full ETL runs & make pytest-integration-full passes locally
Update the release notes: reference the PR and related issues.
Review the PR yourself and call out any questions or issues you have
Options

jdangerx

some preliminary comments

pyproject.toml

src/pudl/metadata/classes.py

src/pudl/metadata/resources/eia.py

jdangerx · 2024-01-24T00:38:05Z

cc @zaneselvans, who I think has done a lot of the recent touching of this, and @cmgosnell + @e-belfer , who have recently expressed some desire for using our hard-won schema information to annotate our asset types - not tagging for review since it's not ready yet, but just an FYI.

bendnorman · 2024-01-24T16:33:02Z

Something to note is dagster's addition of asset checks which allow you to run arbitrary tests after an asset executes. The feature is still experimental but seems pretty useful. I asked how asset checks and something like dagster-pandera compare. If we want to use pandera + dagster asset checks they recommend validating the asset using pandera inside of the asset check instead of using dagster-pandera.

jdangerx · 2024-01-24T20:00:20Z

Ooh, thanks @bendnorman ! Using something that's more "core" dagster is better than using something that's less core :)

zaneselvans · 2024-01-25T07:05:52Z

I'm not sure if it matters but it seems like there's a conflation of 2 related functions here: asset type checking, and more complex data validations that look at the data contents, beyond just the schema (though I guess you could consider constraints on the values and relationships between columns part of a more complicated "type" with the validations that Pydantic does).

But if we use an asset check and happen to use Pandera inside it, will the dataframe typing information still be available to Dagster? Or to us in the IDE? It sounded like there was a vague plan to updated dagster-pandera to incorporate asset checks

For now you should use pandera directly. We'll update dagster-pandera it to add checks support but don't have a date yet

src/pudl/etl/__init__.py

src/pudl/metadata/classes.py

jdangerx · 2024-02-15T19:29:40Z

src/pudl/metadata/classes.py

@@ -1001,6 +1058,69 @@ class ResourceHarvest(PudlMeta):
    """Fraction of invalid fields above which result is considerd invalid."""


+class PudlResourceDescriptor(PudlMeta):


Added this class so that we have a proper description/enforcement of the shape of RESOURCE_METADATAs. This should make it easier for someone to understand how to add a new resource.

Ran into a couple things that are a little funky, documented as TODOs.

jdangerx · 2024-02-15T19:31:13Z

src/pudl/metadata/classes.py

@@ -1185,8 +1305,18 @@ def _check_harvest_primary_key(cls, value, info: ValidationInfo):
        return value

    @staticmethod
-    def dict_from_id(x: str) -> dict:  # noqa: C901
-        """Construct dictionary from PUDL identifier (`resource.name`).
+    def dict_from_id(resource_id: str) -> dict:


It was surprisingly simple to run all of our RESOURCE_METADATA through the new type and validation machinery!

So this uses the new high-level composite PudlResourceDescriptor to validate whatever we've encoded in our giant dictionary of doom, rather than waiting for each of the individual subcomponents to get instantiated and validated? The main goal being to ensure that we have a complete explicit, rather than implicit, description of what needs to be defined for PUDL at the resource / table level?

Yes! Basically:

It's helpful to have something like PudlResourceDescriptor to explicitly encode our bespoke resource descriptor data structure!

It's... much less helpful if we don't actually enforce that that's what we're working with in the rest of the code.

This should also blow up on incorrectly defined resources at validation time, instead of when we're then trying to do all this transformation logic to them.

jdangerx · 2024-02-15T19:32:37Z

src/pudl/validate.py

@@ -1533,6 +1554,7 @@ def plot_vs_agg(orig_df, agg_df, validation_cases):
        "low_bound": 0.95,
        "data_col": "fuel_mmbtu_per_unit",
        "weight_col": "fuel_consumed_units",
+        "xfail": True,


It might be worthwhile to define a BoundsCheckParams class that all these bounds checks conform to, instead of having them be dictionaries that can be anything. It's not necessary for this code to work, though. What do people think?

This module is a horror show written long ago, before we knew about Pydantic or data classes or typed dicts. It's long overdue for an overhaul. I don't even know if it should continue existing after we rip out PudlTabl and change how the tests and validations access the database.

If we are going to be running the validations during the ETL as soon as the relevant table has been created, where does it make the most sense to define the validations that apply to that table? Should it be adjacent to the asset definitions (where all of the cleaning / munging is actually happening)? And then referenced from the Resource definition? Or should it actually be in the resource definition? Or would we want to continue to have a separate module or set of modules dedicated to defining the validations like we do now?

At the same time we probably want to develop a library of generic kinds-of-checks that we'll end up using in many different contexts, instantiating them with appropriate parameters when they're defined for the individual tables that they apply to. So maybe we turn pudl.validations into the place where those validation classes / types are defined?

I think what makes the most sense is:

Use the PUDL resource schemas to define the data type validations. If e.g. "everything in this column should be >10" that seems like it lives in the schema.

More complicated validations should be defined alongside the assets - you can just define an @asset_check right in the module, in which case we'll have to add some load_asset_checks_from_module calls to etl/__init__.py.

We should have utility functions + types related to validations in the pudl.validations module.

I don't think we need to then reference the more complicated validations alongside the schema in RESOURCE_METADATA.

zaneselvans

Wow this is great! Huge progress toward a better validation setup, and yet not huge changes. I had some minor clarifying questions and requests for docstrings, but I think we should get this in as incremental progress and think about whether we want to move on to anything mentioned below:

Since we have so much new data coming into the DB right now, I think it would be helpful to create some more aspirational templates / examples of how we actually want the data validations to look going forward, so that as we add new cases, we can be doing it the Right Way. The factory wrapper for converting the old validation feels little opaque for this purpose.

So it would be nice to have examples of both new df_checks and field_checks which are not tied to the old data validation implementation (though the underlying content could be drawn from those validations if we want)

Probably this would mean implementing at least one parametrized check class and migrating at least some of the existing data validations into that container completely, not just by reference to pudl.validate

Running the validations in the ETL will require more compute, but I think there's likely to be a big overall performance speedup for the validations, given that the way the validations are currently organized, the same dataframes are being read out of the DB and into memory repeatedly (since the tests are grouped by type of validation, not by which dataframe they apply to). This also results in memory usage blowing up if you try and run the data validations on a machine with lots of cores with pytest-xdist.

And then of course we'll also discover violations of our validation checks immediately in the ETL instead of in the nightly builds, which means we'll fix them sooner.

IIRC by default failing asset checks don't cause the ETL to fail, do they? How will we be notified when a check fails? Do we want to set them to ERROR out on check failure? I think I saw that was a new option as of the most recent dagster release.

Are there any checks that we can't encode this way? I guess any check (like referential integrity) that involves more than one table won't work.

src/pudl/etl/__init__.py

zaneselvans · 2024-02-16T19:01:19Z

src/pudl/etl/__init__.py

+    asset_key: AssetKey,
+    package: pudl.metadata.classes.Package,
+) -> AssetChecksDefinition | None:
+    """Create a dagster asset check based on the resource schema, if defined."""


Do we expect there to be cases in which a Resource does not have a Schema defined?

No, I don't think so. But I do expect there to be assets that are built without corresponding Resources - e.g. raw_*.

Ah, okay. I think the docstring could be clearer then. "if defined" is not "if there's a schema defined for this resource" but "if there's a resource defined for this asset."

zaneselvans · 2024-02-16T19:05:46Z

src/pudl/etl/__init__.py

+warnings.filterwarnings("ignore", category=ExperimentalWarning)
+_package = pudl.metadata.classes.Package.from_resource_ids()
+_asset_keys = itertools.chain.from_iterable(
+    _get_keys_from_assets(asset_def) for asset_def in default_assets
+)
+default_asset_checks = [
+    check
+    for check in (
+        asset_check_from_schema(asset_key, _package) for asset_key in _asset_keys
+    )
+    if check is not None
+]
+


How do the default asset checks relate to the default assets / how are they associated with each other? Are they just identified with each other based on the ordering of these lists that are passed into the definition of defs? Or is there some key within the assets and the checks that matches them up?

They're associated by explicitly linking them in the asset param you pass to @asset_check.

src/pudl/metadata/classes.py

zaneselvans · 2024-02-16T19:19:39Z

src/pudl/metadata/classes.py

+    df_checks: list[Callable] = []
+    field_checks: dict[SnakeCase, list[Callable]] = {}


Maybe this shows up below but, where do you envision these additional (beyond the schema) table/column level checks being defined?

src/pudl/metadata/resources/eia923.py

zaneselvans · 2024-02-16T19:36:44Z

src/pudl/validate.py

@@ -1533,6 +1554,7 @@ def plot_vs_agg(orig_df, agg_df, validation_cases):
        "low_bound": 0.95,
        "data_col": "fuel_mmbtu_per_unit",
        "weight_col": "fuel_consumed_units",
+        "xfail": True,


If we are going to be running the validations during the ETL as soon as the relevant table has been created, where does it make the most sense to define the validations that apply to that table? Should it be adjacent to the asset definitions (where all of the cleaning / munging is actually happening)? And then referenced from the Resource definition? Or should it actually be in the resource definition? Or would we want to continue to have a separate module or set of modules dedicated to defining the validations like we do now?

At the same time we probably want to develop a library of generic kinds-of-checks that we'll end up using in many different contexts, instantiating them with appropriate parameters when they're defined for the individual tables that they apply to. So maybe we turn pudl.validations into the place where those validation classes / types are defined?

src/pudl/validate.py

src/pudl/workspace/resource_cache.py

test/unit/metadata_test.py

jdangerx · 2024-02-16T23:25:11Z

Wow this is great! Huge progress toward a better validation setup, and yet not huge changes. I had some minor clarifying questions and requests for docstrings, but I think we should get this in as incremental progress and think about whether we want to move on to anything mentioned below:

Sweet! Yeah, I was a bit bummed that I let the scope creep up (from Do One Asset to Do A Bunch of Assets + Make Some Ergonomic Improvements) , but I think the extra work was very high bang for buck.

Since we have so much new data coming into the DB right now, I think it would be helpful to create some more aspirational templates / examples of how we actually want the data validations to look going forward, so that as we add new cases, we can be doing it the Right Way. The factory wrapper for converting the old validation feels little opaque for this purpose.

So it would be nice to have examples of both new df_checks and field_checks which are not tied to the old data validation implementation (though the underlying content could be drawn from those validations if we want)

Agreed - I think this would be a great follow-up! We should define those directly alongside the assets, I think, as mentioned above. Since @asset_check(asset=foo) will associate the function with the asset, we can define asset checks wherever we want and they'll get picked up. We'll just have to remember to run load_asset_checks_from_module() in etl/__init__.py. Might make sense to have those module-crawlers crawl everything automatically...

Probably this would mean implementing at least one parametrized check class and migrating at least some of the existing data validations into that container completely, not just by reference to pudl.validate

Yeah, sounds right.

Running the validations in the ETL will require more compute, but I think there's likely to be a big overall performance speedup for the validations, given that the way the validations are currently organized, the same dataframes are being read out of the DB and into memory repeatedly (since the tests are grouped by type of validation, not by which dataframe they apply to). This also results in memory usage blowing up if you try and run the data validations on a machine with lots of cores with pytest-xdist.

And then of course we'll also discover violations of our validation checks immediately in the ETL instead of in the nightly builds, which means we'll fix them sooner.

Yes! Agreed on these benefits :)

IIRC by default failing asset checks don't cause the ETL to fail, do they? How will we be notified when a check fails? Do we want to set them to ERROR out on check failure? I think I saw that was a new option as of the most recent dagster release.

I think they don't cause the ETL to stop running but they do cause the whole run to be marked as a "failure". I can make a bogus failing test and run pudl_etl to see if things failed.

Are there any checks that we can't encode this way? I guess any check (like referential integrity) that involves more than one table won't work.

Asset checks have additional_ins which should let us encode checks across multiple tables.

I think the upshot of all this is that we need the following before we merge:

check that a failing asset check does cause pudl_etl to exit with a non-zero status
fix docstrings

And we want to encode some validations in-line instead of in-schema as a follow-up PR. For that one, we could even rip out the "df_checks" and "field_checks" machinery in this PR, and say:

complicated checks live in @asset_checks that are defined alongside each asset
simple dtype/min-max checks are auto-generated based on the field descriptions.
add data types for our different frequent validation check parameters (e.g. BoundsCase)

If we agree on ^, should we then rip out the df_checks and field_checks machinery in this PR pre-emptively?

src/pudl/etl/__init__.py

jdangerx · 2024-02-20T20:27:10Z

Made a "minimal checks" issue here: #3412 - @zaneselvans this should be ready for re-review!

… willy nilly.

…ecks. Lots of ergo improvements to be had.

Still remaining: generate asset checks programmatically, instead of with laborious manual typing.

jdangerx commented Jan 24, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

src/pudl/metadata/classes.py Show resolved Hide resolved

src/pudl/metadata/classes.py Outdated Show resolved Hide resolved

src/pudl/metadata/resources/eia.py Outdated Show resolved Hide resolved

jdangerx force-pushed the prototype-dagster-pandera branch 4 times, most recently from 6230349 to 7157ca3 Compare February 15, 2024 19:25

jdangerx commented Feb 15, 2024

View reviewed changes

jdangerx marked this pull request as ready for review February 15, 2024 19:39

jdangerx force-pushed the prototype-dagster-pandera branch from 7157ca3 to 2e2c27f Compare February 15, 2024 19:40

jdangerx requested a review from zaneselvans February 15, 2024 19:40

zaneselvans requested changes Feb 16, 2024

View reviewed changes

zaneselvans added the testing Writing tests, creating test data, automating testing, etc. label Feb 16, 2024

zaneselvans assigned jdangerx Feb 16, 2024

zaneselvans added the metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. label Feb 16, 2024

zaneselvans reviewed Feb 16, 2024

View reviewed changes

src/pudl/etl/__init__.py Show resolved Hide resolved

jdangerx force-pushed the prototype-dagster-pandera branch 2 times, most recently from 391b637 to afa314b Compare February 20, 2024 20:18

jdangerx added 8 commits February 20, 2024 13:27

Fix types.

1d4c27c

POC: get pandera schema from our Schema

5c0518d

Describe RESOURCE_METADATA elements before I start changing the shape…

f0b9d98

… willy nilly.

Pandera df/col checks + dtypes interop with PudlResourceDescriptors

bbe8994

First attempt at using the existing vs_bounds machinery in Pandera ch…

60422db

…ecks. Lots of ergo improvements to be had.

Use asset check for out_eia923__boiler_fuel

b219116

Still remaining: generate asset checks programmatically, instead of with laborious manual typing.

Automatically generate asset check.

1233261

Generate asset checks automatically for all schemata.

085beb3

jdangerx and others added 4 commits February 20, 2024 13:27

Make schema-based checks xfail-able.

066c59d

Fix docs build and unclosed file warnings.

017eefa

Clarify _get_keys_from_assets

380fcc0

Remove df_checks and field_checks machinery.

a4c4ab6

jdangerx force-pushed the prototype-dagster-pandera branch from afa314b to a4c4ab6 Compare February 20, 2024 20:27

jdangerx requested a review from zaneselvans February 20, 2024 20:27

jdangerx enabled auto-merge February 20, 2024 20:27

zaneselvans approved these changes Feb 20, 2024

View reviewed changes

jdangerx added this pull request to the merge queue Feb 20, 2024

Merged via the queue into main with commit eba35e0 Feb 20, 2024
12 checks passed

jdangerx deleted the prototype-dagster-pandera branch February 20, 2024 23:10

e-belfer linked an issue Feb 22, 2024 that may be closed by this pull request

Enable Pandera dtype and validation checks #3318

Closed

bendnorman mentioned this pull request Feb 22, 2024

Enable Pandera dtype and validation checks #3318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype dagster-pandera integration #3282

Prototype dagster-pandera integration #3282

jdangerx commented Jan 24, 2024 •

edited

Loading

To-do list

jdangerx left a comment

jdangerx commented Jan 24, 2024 •

edited

Loading

bendnorman commented Jan 24, 2024

jdangerx commented Jan 24, 2024

zaneselvans commented Jan 25, 2024

jdangerx Feb 15, 2024

jdangerx Feb 15, 2024

zaneselvans Feb 16, 2024

jdangerx Feb 16, 2024

jdangerx Feb 15, 2024

zaneselvans Feb 15, 2024

zaneselvans Feb 16, 2024

jdangerx Feb 16, 2024

zaneselvans left a comment

zaneselvans Feb 16, 2024

jdangerx Feb 16, 2024

zaneselvans Feb 16, 2024

zaneselvans Feb 16, 2024

jdangerx Feb 16, 2024

zaneselvans Feb 16, 2024

zaneselvans Feb 16, 2024

jdangerx commented Feb 16, 2024 •

edited

Loading

jdangerx commented Feb 20, 2024

		@@ -1001,6 +1058,69 @@ class ResourceHarvest(PudlMeta):
		"""Fraction of invalid fields above which result is considerd invalid."""


		class PudlResourceDescriptor(PudlMeta):

		df_checks: list[Callable] = []
		field_checks: dict[SnakeCase, list[Callable]] = {}

Prototype dagster-pandera integration #3282

Prototype dagster-pandera integration #3282

Conversation

jdangerx commented Jan 24, 2024 • edited Loading

Overview

Changes:

Testing

To-do list

jdangerx left a comment

Choose a reason for hiding this comment

jdangerx commented Jan 24, 2024 • edited Loading

bendnorman commented Jan 24, 2024

jdangerx commented Jan 24, 2024

zaneselvans commented Jan 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx commented Feb 16, 2024 • edited Loading

jdangerx commented Feb 20, 2024

jdangerx commented Jan 24, 2024 •

edited

Loading

jdangerx commented Jan 24, 2024 •

edited

Loading

jdangerx commented Feb 16, 2024 •

edited

Loading