-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data validation project design conversation #3971
Comments
Background:I was moving forward with the design/prototyping of our validation framework based on great expectations upon discovering that there are dependency conflicts with PUDL stemming from numpy/pandas. Possible solutions:Downgrade dependencies-We could downgrade several dependencies to make great expectations installable. The following dependencies would need to be downgraded:
Use an alternate tool-Soda-I was initially hesitant about Soda because it seemed really coupled to it’s paid ecosystem. However, after spending more time with other tools and diving deeper into their docs, I think that there’s a decent amount of functionality that we could get out of the core library Pros:
Cons:
Pandera-I'm including Pros:
Cons:
Create a separate environment for data validationWe could develop a separate virtual environment for data validation to circumvent the dependency conflicts. While this seems to be fairly akin to how many data validation tools expect to be used, I don't think it would be compatible with our desires for this framework. For one, a major desire of ours has been rapid feedback, which would be hard to achieve if we end up with separate environments. Develop a fully custom frameworkThis option sounds bad, but after spending more time with some of these tools, I don't think it's the worst option. We already will most likely be developing our own high-level API to integrate any tools with Dagster nicely, and we could convert the existing validation functions we have to use this API. There's also a lot features provided by My feelingsAfter a deeper dive into If there are any blockers to getting this done, I think downgrading dependencies and moving forward with |
@zschira Are we confident that none of our other packages require |
Downgrading / GX dependenciesUnless GX is getting abandoned it seems like they will absolutely have to update their dependencies. But pandas 2.2 and Numpy 2.0 have been out for quite a while now (1 year and 7 months respectively) so it's surprising and a little disturbing that this hasn't already happened and makes me wonder what the GX maintenance & development situation is like. If we really like what GX offers, would it be a crazy lift to try and help them migrate to newer versions? There weren't any horrendous issues when we made the switch. Also a bummer that they explicitly don't want to add DuckDB support at this point. @e-belfer I don't know if any of our dependencies have hard requirements for Separate environmentI think this would mean we can't use the data validation framework in our asset checks, which is the main way we're hoping to get fast & early feedback, so this seems like a non-starter. SodaIt looks like there are a bunch of different python packages hiding inside the repo, one for each of their integrations, which would be a more complex than normal packaging setup for It wasn't immediately clear to me which parts of the Soda ecosystem were available through soda-core. Like can we create user-defined checks? Or would those have to be done in our own bespoke setup? The anomaly detection and distribution checks looked like they were only part of Soda Scientific. Oh yikes, it also requires even older versions of pandas, pydantic, and numpy than GX. CustomBeing able to benefit from someone else's work developing and also testing data validation checks seems like a big benefit to me, and presumably the suite of checks would improve/expand over time. Though maybe not if all the new goodies end up behind the paywall of an open-core project. Also not having to manage the system for specifying tests and parsing those specifications sounds very nice. dbt?Did you explore using |
@e-belfer it looks like @zaneselvans I have looked at |
Ah sorry, I meant any of the packages in PUDL currently, but seems like Zane addressed this above! |
Thanks for all the digging @zschira and @zaneselvans ! tl;dr: I think I think hitching our dependency wagon to GX is pretty risky. If they don't move in a timely way, we'd have to port our whole data validation suite to a new new tool if we want to upgrade for any reason. I think
For custom tests we could either write them as SQL and put them in Adding |
I'm also both intrigued by and afraid of the possibility of having Given DuckDB's ability to query dataframes directly as well as Parquet files and DBs, I wonder if Unfortunately I see now that the |
Did some more digging on enabling Overview
Possible setupsUse
|
Seems like if we set up |
It seems like in many of these options we're running into modest shortcomings within another open source project, which if fixed would provide value to lots of users, and which I suspect would be less work (and more likely to be shared work) than rolling our own system, so I'm inclined to want to explore how we might coordinate making the existing tools work how we need them to.
Any of those options would hopefully:
I'm least familiar with Soda, and most nervous about them from an openness point of view, so my gut intuition is to try for the GX or
|
We (@zschira @jdangerx and I) had a call discussing this, and decided to prototype a few existing tests using
|
Overview
We need to have a plan before we jump into this data validation project. Plus the plan has to fit roughly into 75 hours, less the design time (and needs to take review / adjustments into account!)
Success criteria:
Next steps:
At the very minimum we should pick @zaneselvans 's brain about his complaints before he leaves, and ideally we should get some input from him in the product design synthesis part.
The text was updated successfully, but these errors were encountered: