Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data validation project design conversation #3971

Closed
3 of 7 tasks
jdangerx opened this issue Nov 19, 2024 · 11 comments · Fixed by #4011
Closed
3 of 7 tasks

Data validation project design conversation #3971

jdangerx opened this issue Nov 19, 2024 · 11 comments · Fixed by #4011
Assignees

Comments

@jdangerx
Copy link
Member

jdangerx commented Nov 19, 2024

Overview

We need to have a plan before we jump into this data validation project. Plus the plan has to fit roughly into 75 hours, less the design time (and needs to take review / adjustments into account!)

Success criteria:

  • short document that lays out the product design of the system: what specific use cases are we solving for? what hard things are we not going to do? how does the user (us, as developers) interact with the new data validation library?
  • short document laying out technical design - what components come together to present the high level interface from the product design? how do they interact with each other / what operations do they expose? what data structures do they exchange?

Next steps:

  • get people to complain about the data validation system
  • decide who needs to "approve" the product design
  • synthesize the complaints into the product design
  • get approval
  • figure out approvers, work out technical design, get approval there too

At the very minimum we should pick @zaneselvans 's brain about his complaints before he leaves, and ideally we should get some input from him in the product design synthesis part.

@jdangerx jdangerx self-assigned this Nov 19, 2024
@jdangerx jdangerx converted this from a draft issue Nov 19, 2024
@jdangerx jdangerx assigned zschira and unassigned jdangerx Nov 19, 2024
@jdangerx jdangerx moved this from Backlog to In review in Catalyst Megaproject Dec 17, 2024
@zschira
Copy link
Member

zschira commented Jan 7, 2025

Background:

I was moving forward with the design/prototyping of our validation framework based on great expectations upon discovering that there are dependency conflicts with PUDL stemming from numpy/pandas.

Possible solutions:

Downgrade dependencies-

We could downgrade several dependencies to make great expectations installable. The following dependencies would need to be downgraded:

  • pandas < 2.2
  • numpy < 2.0
  • splink < 4

Use an alternate tool-

Soda-

I was initially hesitant about Soda because it seemed really coupled to it’s paid ecosystem. However, after spending more time with other tools and diving deeper into their docs, I think that there’s a decent amount of functionality that we could get out of the core library

Pros:

  • duckdb support
  • Checks are highly configurable
  • Good support for custom sql checks

Cons:

  • Not available on conda-forge. We could probably help them contribute a feedstock, but this would be a blocker to adding soda as a dependency

Pandera-

I'm including pandera since it's a tool we already use and have looked at before, but I don't think it's particularly well suited to our data validation needs.

Pros:

  • Really nice expressive API
  • Useful integration with dagster
  • Supports custom checks

Cons:

  • Limited support for checks involving multiple columns
  • Little control over when/how checks are executed
  • No support for multi-table checks

Create a separate environment for data validation

We could develop a separate virtual environment for data validation to circumvent the dependency conflicts. While this seems to be fairly akin to how many data validation tools expect to be used, I don't think it would be compatible with our desires for this framework. For one, a major desire of ours has been rapid feedback, which would be hard to achieve if we end up with separate environments.

Develop a fully custom framework

This option sounds bad, but after spending more time with some of these tools, I don't think it's the worst option. We already will most likely be developing our own high-level API to integrate any tools with Dagster nicely, and we could convert the existing validation functions we have to use this API. There's also a lot features provided by Great Expectations and Soda are mostly tailored towards their hosted versions, which we probably won't use anyway. That being said, we would lose the collection of predefined validations available in these libraries, which I think are fairly valuable on their own, and warrant attempting to get one of these libraries working for us.

My feelings

After a deeper dive into Soda, I think it offers a lot of what we would get out of Great Expectations, but with the added benefit of duckdb support. I also think we could develop a very similar API to what I've been outlining/prototyping with Great Expectations, with Soda under the hood. Given all of this, I would propose that we look into getting a Soda feedstock in conda-forge, and going forward from there.

If there are any blockers to getting this done, I think downgrading dependencies and moving forward with Great Expectations is not ideal, but a reasonable option. Hopefully, with some pressure we also might get them to move forward with pandas 2.2 support in the near future.

@e-belfer
Copy link
Member

e-belfer commented Jan 7, 2025

@zschira Are we confident that none of our other packages require pandas >2.2 ?

@zaneselvans
Copy link
Member

Downgrading / GX dependencies

Unless GX is getting abandoned it seems like they will absolutely have to update their dependencies. But pandas 2.2 and Numpy 2.0 have been out for quite a while now (1 year and 7 months respectively) so it's surprising and a little disturbing that this hasn't already happened and makes me wonder what the GX maintenance & development situation is like. If we really like what GX offers, would it be a crazy lift to try and help them migrate to newer versions? There weren't any horrendous issues when we made the switch.

Also a bummer that they explicitly don't want to add DuckDB support at this point.

@e-belfer I don't know if any of our dependencies have hard requirements for pandas>=2.2 / numpy>=2 but I think PUDL itself might (which suggests dependencies might) since there were breaking API changes that we had to accommodate. So I'm not sure it's easy to have a codebase that's happy with both the earlier and later versions.

Separate environment

I think this would mean we can't use the data validation framework in our asset checks, which is the main way we're hoping to get fast & early feedback, so this seems like a non-starter.

Soda

It looks like there are a bunch of different python packages hiding inside the repo, one for each of their integrations, which would be a more complex than normal packaging setup for conda-forge but definitely doable. Though I note they don't seem to have put much effort into their existing packaging -- the PyPI package has basically no metadata. They're using a hard pin to DuckDB 1.0.0 and deprecated setup.py packaging for all their sub-packages, which I think might need to be updated to make conda-forge happy.

It wasn't immediately clear to me which parts of the Soda ecosystem were available through soda-core. Like can we create user-defined checks? Or would those have to be done in our own bespoke setup? The anomaly detection and distribution checks looked like they were only part of Soda Scientific. Oh yikes, it also requires even older versions of pandas, pydantic, and numpy than GX.

Custom

Being able to benefit from someone else's work developing and also testing data validation checks seems like a big benefit to me, and presumably the suite of checks would improve/expand over time. Though maybe not if all the new goodies end up behind the paywall of an open-core project. Also not having to manage the system for specifying tests and parsing those specifications sounds very nice.

dbt?

Did you explore using dbt and its data testing functionality at all? I think it's got deep integration with Dagster and a pretty big library of core and community contributed data tests that we could use, and which (like Soda) are specified using YAML. The downside here would be that for any novel test cases we would need to code them in SQL. Or I guess we have a hybrid arrangement where anything that's covered by existing dbt tests we use dbt, and then we code our own asset checks outside of that framework using Python. I believe dbt also has a DuckDB driver. But this might also be adding too much additional infrastructure if we're just using it for testing.

@zschira
Copy link
Member

zschira commented Jan 7, 2025

@e-belfer it looks like soda-scientific which adds some additional types of checks requires pandas < 2.2, but otherwise I don't think any other options have that limitation.

@zaneselvans I have looked at dbt and it felt like a lot of overhead to set up for just the testing options. I also couldn't quite figure out whether or not the dagster integration will work with python assets. However, given that nothing else has quite panned out as hoped, I think it's probably worth another look. I'll do some further research before our meeting tomorrow.

@e-belfer
Copy link
Member

e-belfer commented Jan 7, 2025

Ah sorry, I meant any of the packages in PUDL currently, but seems like Zane addressed this above!

@jdangerx
Copy link
Member Author

jdangerx commented Jan 7, 2025

Thanks for all the digging @zschira and @zaneselvans !

tl;dr: I think dbt is worth trying out because it's got the best OSS ecosystem, there's a decent library of validations already out there, and it has good integration with Dagster. It seems like it will save us time and heartache vs. writing custom framework from scratch, and get us useful new validations quickly, while still leaving space for custom additions in the future.


I think hitching our dependency wagon to GX is pretty risky. If they don't move in a timely way, we'd have to port our whole data validation suite to a new new tool if we want to upgrade for any reason.

I think soda and dbt are both pretty reasonable, though the murkiness of the soda docs around what's really supported in OSS vs. paid versions has me a little bit suspicious.

dbt seems the most well-maintained / the open-source darling. It seems slightly awkward to run data tests on our python assets that aren't persisted, but anything that gets persisted as parquet or sqlite should be pretty easy to define as a 'source' which we can define tests for. Those tests should get pulled into Dagster automatically which is nice.

For custom tests we could either write them as SQL and put them in dbt, or continue to write custom Python code and shove them into manually defined @asset_checks which also seems fine.

Adding dbt opens up possibilities for us to use it for transformations as well, down the line - I'm not sure how I feel about that. On one hand, adding more functionality than we need is asking for trouble; on the other hand, I have some hazy vision of "duckdb + dbt + dagster's now-quite-powerful partition management" as a possible direction for scaling our ETL into the future.

@zaneselvans
Copy link
Member

I'm also both intrigued by and afraid of the possibility of having dbt based transformations available. I think we'll always need some Python/dataframe based transforms, but dbt and Dagster both admit at this point that pure SQL isn't enough to cover all use cases.

Given DuckDB's ability to query dataframes directly as well as Parquet files and DBs, I wonder if dbt + DuckDB + pickled dataframes from Dagster could actually work fine?

Unfortunately I see now that the dbt-expectations library is no longer actively maintained as of late last year. I wonder why, and if anyone else will pick it up? I know it has some of the statistical tests that we want in it, so maybe we would just vendor them into our codebase as custom tests.

@zschira
Copy link
Member

zschira commented Jan 8, 2025

Did some more digging on enabling dbt tests within our ETL:

Overview

dbt transformations are defined in what they call models. To interact with assets generated in python you would use sources. sources tell dbt how to load data which is produced/transformed externally (like our existing python assets). The dagster-dbt integration will automatically figure out the dependency links between sources (normal python assets) and models managed by dbt. In dbt you can define tests on both sources and models, however, dagster does not currently support loading source tests as asset_checks. This is discussed in this issue. This is unfortunate because that's the primary functionality we are looking for.

Possible setups

Use dbt as a standalone testing tool (for now)

Even though dagster does not support importing source tests, we could still setup a dbt project with sources and tests and just run it in a standalone way for testing. This doesn't provide us the rapid feedback during development that we get from using asset_checks, but the issue I linked above about enabling source tests within dagster has had a good amount of activity, so it's possible this functionality could be coming down the road.

Add dbt models that mirror our python assets

We could also add models that just SELECT all of the rows from our existing python assets, then define tests for these models. This would allow us to run tests as asset_checks, however there are several drawbacks I see with this approach. First, there would be a lot of boiler plate required to create these models. We might be able to template/autogenerate some of this boiler plate, but that could be error prone. The second drawback is that we would then have these 'mirror' assets, which hang off the end of our dag, and you would have to run these assets to actually run the asset_checks.

Run dbt CLI in asset_checks

We could setup a standalone dbt project like the first option listed, then try to run the CLI in asset_checks to execute tests. This seems potentially possible, but pretty weird and would require a lot of bespoke logic to handle building the project, executing the CLI, and parsing the results.

@jdangerx
Copy link
Member Author

jdangerx commented Jan 8, 2025

Seems like if we set up dbt with just a bunch of sources pointing at our assets (annoying duplication, but OK) and use DbtCliResource.cli(...) in an asset check, that could work? Something like how we read the RESOURCES_METADATA and generate a bunch of pandera checks now, but we'd be parsing the dbt test files and generating a bunch of DBT test calls...

@zaneselvans
Copy link
Member

It seems like in many of these options we're running into modest shortcomings within another open source project, which if fixed would provide value to lots of users, and which I suspect would be less work (and more likely to be shared work) than rolling our own system, so I'm inclined to want to explore how we might coordinate making the existing tools work how we need them to.

  • Implement Dagster + dbt asset checks for dbt sources.
  • Update GX dependencies and add/maintain a DuckDB connector for GX
  • Modernize Soda packaging and get it up on conda-forge

Any of those options would hopefully:

  • Let us use fresh versions of our dependencies
  • Allow the use of DuckDB for querying assets and limiting peak memory usage
  • Ensure that we can use asset checks in a straightforward way within our DAG for rapid feedback
  • Allow us to take advantage of existing suites of data validation tests

I'm least familiar with Soda, and most nervous about them from an openness point of view, so my gut intuition is to try for the GX or dbt options.

  • Could we interest the DuckDB developers / MotherDuck folks in contributing to or helping to maintain the GX connector?
  • What other users / constituencies would be interested in the Dagster + DuckDB + dbt source checks? How could we move it up their list of priorities?

@zaneselvans
Copy link
Member

We (@zschira @jdangerx and I) had a call discussing this, and decided to prototype a few existing tests using dbt to see if there are any issues. If it works, this option has the following advantages:

  • A big existing user base / institutional support
  • Lots of existing data validation tests, even if we need to vendor / fork dbt-expectations (and it also seems not unlikely that someone else might pick up the ongoing maintenance of those tests, if they ever end up breaking, which also seems somewhat unlikely, given how stable SQL and dbt are).
  • Would allow us to start migrating existing data validations immediately, and figure out how to get the dbt data tests running inside asset checks in parallel (with an asset check factory as a shim, using the CLI API, or an upstream contribution to Dagster)
  • One of our new member candidates is already familiar with dbt
  • dbt is already deeply integrated with Dagster and seems likely to remain that way
  • Good DuckDB support.
  • dbt is extremely flexible in allowing novel user-defined data tests with SQL, but using existing tests doesn't require any knowledge of SQL -- it's all specified using YAML.
  • The Dagster folks seem very open to getting dbt sources working in asset checks, it just isn't something they've prioritized so far.

@zschira zschira mentioned this issue Jan 13, 2025
9 tasks
@jdangerx jdangerx linked a pull request Feb 10, 2025 that will close this issue
9 tasks
@github-project-automation github-project-automation bot moved this from In review to Done in Catalyst Megaproject Feb 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants