Data validation project design conversation #3971

jdangerx · 2024-11-19T15:02:31Z

Overview

We need to have a plan before we jump into this data validation project. Plus the plan has to fit roughly into 75 hours, less the design time (and needs to take review / adjustments into account!)

Success criteria:

short document that lays out the product design of the system: what specific use cases are we solving for? what hard things are we not going to do? how does the user (us, as developers) interact with the new data validation library?
short document laying out technical design - what components come together to present the high level interface from the product design? how do they interact with each other / what operations do they expose? what data structures do they exchange?

Next steps:

get people to complain about the data validation system
decide who needs to "approve" the product design
synthesize the complaints into the product design
get approval
figure out approvers, work out technical design, get approval there too

At the very minimum we should pick @zaneselvans 's brain about his complaints before he leaves, and ideally we should get some input from him in the product design synthesis part.

zschira · 2025-01-07T17:15:36Z

Background:

I was moving forward with the design/prototyping of our validation framework based on great expectations upon discovering that there are dependency conflicts with PUDL stemming from numpy/pandas.

Possible solutions:

Downgrade dependencies-

We could downgrade several dependencies to make great expectations installable. The following dependencies would need to be downgraded:

pandas < 2.2
numpy < 2.0
splink < 4

Use an alternate tool-

Soda-

I was initially hesitant about Soda because it seemed really coupled to it’s paid ecosystem. However, after spending more time with other tools and diving deeper into their docs, I think that there’s a decent amount of functionality that we could get out of the core library

Pros:

duckdb support
Checks are highly configurable
Good support for custom sql checks

Cons:

Not available on conda-forge. We could probably help them contribute a feedstock, but this would be a blocker to adding soda as a dependency

Pandera-

I'm including pandera since it's a tool we already use and have looked at before, but I don't think it's particularly well suited to our data validation needs.

Pros:

Really nice expressive API
Useful integration with dagster
Supports custom checks

Cons:

Limited support for checks involving multiple columns
Little control over when/how checks are executed
No support for multi-table checks

Create a separate environment for data validation

We could develop a separate virtual environment for data validation to circumvent the dependency conflicts. While this seems to be fairly akin to how many data validation tools expect to be used, I don't think it would be compatible with our desires for this framework. For one, a major desire of ours has been rapid feedback, which would be hard to achieve if we end up with separate environments.

Develop a fully custom framework

This option sounds bad, but after spending more time with some of these tools, I don't think it's the worst option. We already will most likely be developing our own high-level API to integrate any tools with Dagster nicely, and we could convert the existing validation functions we have to use this API. There's also a lot features provided by Great Expectations and Soda are mostly tailored towards their hosted versions, which we probably won't use anyway. That being said, we would lose the collection of predefined validations available in these libraries, which I think are fairly valuable on their own, and warrant attempting to get one of these libraries working for us.

My feelings

After a deeper dive into Soda, I think it offers a lot of what we would get out of Great Expectations, but with the added benefit of duckdb support. I also think we could develop a very similar API to what I've been outlining/prototyping with Great Expectations, with Soda under the hood. Given all of this, I would propose that we look into getting a Soda feedstock in conda-forge, and going forward from there.

If there are any blockers to getting this done, I think downgrading dependencies and moving forward with Great Expectations is not ideal, but a reasonable option. Hopefully, with some pressure we also might get them to move forward with pandas 2.2 support in the near future.

e-belfer · 2025-01-07T17:30:51Z

@zschira Are we confident that none of our other packages require pandas >2.2 ?

zaneselvans · 2025-01-07T18:43:59Z

Downgrading / GX dependencies

Unless GX is getting abandoned it seems like they will absolutely have to update their dependencies. But pandas 2.2 and Numpy 2.0 have been out for quite a while now (1 year and 7 months respectively) so it's surprising and a little disturbing that this hasn't already happened and makes me wonder what the GX maintenance & development situation is like. If we really like what GX offers, would it be a crazy lift to try and help them migrate to newer versions? There weren't any horrendous issues when we made the switch.

Also a bummer that they explicitly don't want to add DuckDB support at this point.

@e-belfer I don't know if any of our dependencies have hard requirements for pandas>=2.2 / numpy>=2 but I think PUDL itself might (which suggests dependencies might) since there were breaking API changes that we had to accommodate. So I'm not sure it's easy to have a codebase that's happy with both the earlier and later versions.

Separate environment

I think this would mean we can't use the data validation framework in our asset checks, which is the main way we're hoping to get fast & early feedback, so this seems like a non-starter.

Soda

It looks like there are a bunch of different python packages hiding inside the repo, one for each of their integrations, which would be a more complex than normal packaging setup for conda-forge but definitely doable. Though I note they don't seem to have put much effort into their existing packaging -- the PyPI package has basically no metadata. They're using a hard pin to DuckDB 1.0.0 and deprecated setup.py packaging for all their sub-packages, which I think might need to be updated to make conda-forge happy.

It wasn't immediately clear to me which parts of the Soda ecosystem were available through soda-core. Like can we create user-defined checks? Or would those have to be done in our own bespoke setup? The anomaly detection and distribution checks looked like they were only part of Soda Scientific. Oh yikes, it also requires even older versions of pandas, pydantic, and numpy than GX.

Custom

Being able to benefit from someone else's work developing and also testing data validation checks seems like a big benefit to me, and presumably the suite of checks would improve/expand over time. Though maybe not if all the new goodies end up behind the paywall of an open-core project. Also not having to manage the system for specifying tests and parsing those specifications sounds very nice.

dbt?

Did you explore using dbt and its data testing functionality at all? I think it's got deep integration with Dagster and a pretty big library of core and community contributed data tests that we could use, and which (like Soda) are specified using YAML. The downside here would be that for any novel test cases we would need to code them in SQL. Or I guess we have a hybrid arrangement where anything that's covered by existing dbt tests we use dbt, and then we code our own asset checks outside of that framework using Python. I believe dbt also has a DuckDB driver. But this might also be adding too much additional infrastructure if we're just using it for testing.

zschira · 2025-01-07T20:36:58Z

@e-belfer it looks like soda-scientific which adds some additional types of checks requires pandas < 2.2, but otherwise I don't think any other options have that limitation.

@zaneselvans I have looked at dbt and it felt like a lot of overhead to set up for just the testing options. I also couldn't quite figure out whether or not the dagster integration will work with python assets. However, given that nothing else has quite panned out as hoped, I think it's probably worth another look. I'll do some further research before our meeting tomorrow.

e-belfer · 2025-01-07T20:47:52Z

Ah sorry, I meant any of the packages in PUDL currently, but seems like Zane addressed this above!

jdangerx · 2025-01-07T21:12:29Z

Thanks for all the digging @zschira and @zaneselvans !

tl;dr: I think dbt is worth trying out because it's got the best OSS ecosystem, there's a decent library of validations already out there, and it has good integration with Dagster. It seems like it will save us time and heartache vs. writing custom framework from scratch, and get us useful new validations quickly, while still leaving space for custom additions in the future.

I think hitching our dependency wagon to GX is pretty risky. If they don't move in a timely way, we'd have to port our whole data validation suite to a new new tool if we want to upgrade for any reason.

I think soda and dbt are both pretty reasonable, though the murkiness of the soda docs around what's really supported in OSS vs. paid versions has me a little bit suspicious.

dbt seems the most well-maintained / the open-source darling. It seems slightly awkward to run data tests on our python assets that aren't persisted, but anything that gets persisted as parquet or sqlite should be pretty easy to define as a 'source' which we can define tests for. Those tests should get pulled into Dagster automatically which is nice.

For custom tests we could either write them as SQL and put them in dbt, or continue to write custom Python code and shove them into manually defined @asset_checks which also seems fine.

Adding dbt opens up possibilities for us to use it for transformations as well, down the line - I'm not sure how I feel about that. On one hand, adding more functionality than we need is asking for trouble; on the other hand, I have some hazy vision of "duckdb + dbt + dagster's now-quite-powerful partition management" as a possible direction for scaling our ETL into the future.

zaneselvans · 2025-01-07T22:07:48Z

I'm also both intrigued by and afraid of the possibility of having dbt based transformations available. I think we'll always need some Python/dataframe based transforms, but dbt and Dagster both admit at this point that pure SQL isn't enough to cover all use cases.

Given DuckDB's ability to query dataframes directly as well as Parquet files and DBs, I wonder if dbt + DuckDB + pickled dataframes from Dagster could actually work fine?

Unfortunately I see now that the dbt-expectations library is no longer actively maintained as of late last year. I wonder why, and if anyone else will pick it up? I know it has some of the statistical tests that we want in it, so maybe we would just vendor them into our codebase as custom tests.

zschira · 2025-01-08T15:48:55Z

Did some more digging on enabling dbt tests within our ETL:

Overview

dbt transformations are defined in what they call models. To interact with assets generated in python you would use sources. sources tell dbt how to load data which is produced/transformed externally (like our existing python assets). The dagster-dbt integration will automatically figure out the dependency links between sources (normal python assets) and models managed by dbt. In dbt you can define tests on both sources and models, however, dagster does not currently support loading source tests as asset_checks. This is discussed in this issue. This is unfortunate because that's the primary functionality we are looking for.

Possible setups

Use `dbt` as a standalone testing tool (for now)

Even though dagster does not support importing source tests, we could still setup a dbt project with sources and tests and just run it in a standalone way for testing. This doesn't provide us the rapid feedback during development that we get from using asset_checks, but the issue I linked above about enabling source tests within dagster has had a good amount of activity, so it's possible this functionality could be coming down the road.

Add `dbt models` that mirror our python assets

We could also add models that just SELECT all of the rows from our existing python assets, then define tests for these models. This would allow us to run tests as asset_checks, however there are several drawbacks I see with this approach. First, there would be a lot of boiler plate required to create these models. We might be able to template/autogenerate some of this boiler plate, but that could be error prone. The second drawback is that we would then have these 'mirror' assets, which hang off the end of our dag, and you would have to run these assets to actually run the asset_checks.

Run `dbt` CLI in `asset_check`s

We could setup a standalone dbt project like the first option listed, then try to run the CLI in asset_checks to execute tests. This seems potentially possible, but pretty weird and would require a lot of bespoke logic to handle building the project, executing the CLI, and parsing the results.

jdangerx · 2025-01-08T18:19:09Z

Seems like if we set up dbt with just a bunch of sources pointing at our assets (annoying duplication, but OK) and use DbtCliResource.cli(...) in an asset check, that could work? Something like how we read the RESOURCES_METADATA and generate a bunch of pandera checks now, but we'd be parsing the dbt test files and generating a bunch of DBT test calls...

zaneselvans · 2025-01-08T19:53:21Z

It seems like in many of these options we're running into modest shortcomings within another open source project, which if fixed would provide value to lots of users, and which I suspect would be less work (and more likely to be shared work) than rolling our own system, so I'm inclined to want to explore how we might coordinate making the existing tools work how we need them to.

Implement Dagster + dbt asset checks for dbt sources.
Update GX dependencies and add/maintain a DuckDB connector for GX
Modernize Soda packaging and get it up on conda-forge

Any of those options would hopefully:

Let us use fresh versions of our dependencies
Allow the use of DuckDB for querying assets and limiting peak memory usage
Ensure that we can use asset checks in a straightforward way within our DAG for rapid feedback
Allow us to take advantage of existing suites of data validation tests

I'm least familiar with Soda, and most nervous about them from an openness point of view, so my gut intuition is to try for the GX or dbt options.

Could we interest the DuckDB developers / MotherDuck folks in contributing to or helping to maintain the GX connector?
What other users / constituencies would be interested in the Dagster + DuckDB + dbt source checks? How could we move it up their list of priorities?

zaneselvans · 2025-01-08T21:55:37Z

We (@zschira @jdangerx and I) had a call discussing this, and decided to prototype a few existing tests using dbt to see if there are any issues. If it works, this option has the following advantages:

A big existing user base / institutional support
Lots of existing data validation tests, even if we need to vendor / fork dbt-expectations (and it also seems not unlikely that someone else might pick up the ongoing maintenance of those tests, if they ever end up breaking, which also seems somewhat unlikely, given how stable SQL and dbt are).
Would allow us to start migrating existing data validations immediately, and figure out how to get the dbt data tests running inside asset checks in parallel (with an asset check factory as a shim, using the CLI API, or an upstream contribution to Dagster)
One of our new member candidates is already familiar with dbt
dbt is already deeply integrated with Dagster and seems likely to remain that way
Good DuckDB support.
dbt is extremely flexible in allowing novel user-defined data tests with SQL, but using existing tests doesn't require any knowledge of SQL -- it's all specified using YAML.
The Dagster folks seem very open to getting dbt sources working in asset checks, it just isn't something they've prioritized so far.

jdangerx added this to Catalyst Megaproject Nov 18, 2024

jdangerx self-assigned this Nov 19, 2024

jdangerx converted this from a draft issue Nov 19, 2024

jdangerx assigned zschira and unassigned jdangerx Nov 19, 2024

jdangerx moved this from Backlog to In review in Catalyst Megaproject Dec 17, 2024

zschira mentioned this issue Jan 13, 2025

Dbt setup #4011

Merged

9 tasks

jdangerx linked a pull request Feb 10, 2025 that will close this issue

Dbt setup #4011

Merged

9 tasks

zaneselvans closed this as completed in #4011 Feb 22, 2025

github-project-automation bot moved this from In review to Done in Catalyst Megaproject Feb 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data validation project design conversation #3971

Data validation project design conversation #3971

jdangerx commented Nov 19, 2024 •

edited by zschira

Loading

zschira commented Jan 7, 2025 •

edited

Loading

e-belfer commented Jan 7, 2025

zaneselvans commented Jan 7, 2025

zschira commented Jan 7, 2025

e-belfer commented Jan 7, 2025

jdangerx commented Jan 7, 2025 •

edited

Loading

zaneselvans commented Jan 7, 2025

zschira commented Jan 8, 2025

jdangerx commented Jan 8, 2025

zaneselvans commented Jan 8, 2025

zaneselvans commented Jan 8, 2025

Data validation project design conversation #3971

Data validation project design conversation #3971

Comments

jdangerx commented Nov 19, 2024 • edited by zschira Loading

Overview

Success criteria:

Next steps:

zschira commented Jan 7, 2025 • edited Loading

Background:

Possible solutions:

Downgrade dependencies-

Use an alternate tool-

Soda-

Pandera-

Create a separate environment for data validation

Develop a fully custom framework

My feelings

e-belfer commented Jan 7, 2025

zaneselvans commented Jan 7, 2025

Downgrading / GX dependencies

Separate environment

Soda

Custom

dbt?

zschira commented Jan 7, 2025

e-belfer commented Jan 7, 2025

jdangerx commented Jan 7, 2025 • edited Loading

zaneselvans commented Jan 7, 2025

zschira commented Jan 8, 2025

Overview

Possible setups

Use dbt as a standalone testing tool (for now)

Add dbt models that mirror our python assets

Run dbt CLI in asset_checks

jdangerx commented Jan 8, 2025

zaneselvans commented Jan 8, 2025

zaneselvans commented Jan 8, 2025

jdangerx commented Nov 19, 2024 •

edited by zschira

Loading

zschira commented Jan 7, 2025 •

edited

Loading

jdangerx commented Jan 7, 2025 •

edited

Loading

Use `dbt` as a standalone testing tool (for now)

Add `dbt models` that mirror our python assets

Run `dbt` CLI in `asset_check`s