Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyDeequ support to Apache Spark 3.4.0 (and ideally 3.5.0) #192

Open
machadoluiz opened this issue Feb 29, 2024 · 30 comments
Open

PyDeequ support to Apache Spark 3.4.0 (and ideally 3.5.0) #192

machadoluiz opened this issue Feb 29, 2024 · 30 comments
Labels
enhancement New feature or request feature request Feature request

Comments

@machadoluiz
Copy link

machadoluiz commented Feb 29, 2024

Is your feature request related to a problem? Please describe.
I'm currently facing issues with the PyDeequ support to Apache Spark version 3.4.0, since it is impacting several projects in my organization that uses PyDeequ as a data quality tool. The problem arises because our EMR clusters are required to support the latest version releases, but since the release of emr-6.12.0, the support for Apache Spark 3.3.x has been dropped.

Describe the solution you'd like
I would like PyDeequ to be updated to support Apache Spark 3.4.0 and ideally, also the most recent version 3.5.0. I would also like to understand the requirements for this support, such as whether there are any backwards compatibility requirements for PyDeequ, and whether it is necessary for all future PyDeequ versions to continue supporting all of the currently supported Spark and Deequ versions, or if there is scope for dropping support for some versions, as mentioned on #178.

Describe alternatives you've considered
As an alternative, we have considered migrating to Great Expectations due to its active maintenance and large community. However, PyDeequ is still preferred due to its seamless integration with our internal PySpark library. The transition to a new tool would also require significant resources and time. Therefore, having PyDeequ support Apache Spark 3.4.0 and 3.5.0 would be the most beneficial solution for us.

Additional context
It seems that Deequ is already supporting Apache Spark 3.4.0 (#505) and most recently 3.5.0 (#514).

@chenliu0831
Copy link
Contributor

We treat backward compatibility very seriously, as all AWS API or owned library does. Dropping support for EOL Spark version can be an option but it need a bit more research.

I don't think it's very hard to fix #169 but the change should be made at Deequ Scala land (adding overloaded functions with the old parameters). We currently do not have a date.

@chenliu0831
Copy link
Contributor

As a workaround, you can set env var SPARK_VERSION=3.3 and to my knowledge most PyDeequ features should continue work. Although unlikely, there might be runtime errors from any breaking changes between Spark versions 3.3 and 3.5.

@LucasSchelkes-BA
Copy link

As a workaround, you can set env var SPARK_VERSION=3.3 and to my knowledge most PyDeequ features should continue work. Although unlikely, there might be runtime errors from any breaking changes between Spark versions 3.3 and 3.5.

I see. But is a native support of higher spark versions planned at all? If yes, for when is it scheduled?

@Joao-DEUS-DE
Copy link

Is there a date for when this update might be expected? I am currenlty working in a project that uses pyspark 3.4.1 in databricks and I would like to use pydeequ

@carlacha
Copy link

Hello! Just checking in to see if there's any news on when we might expect that new feature to drop? Any rough idea of a release date? I'm using this library in my project and need to upgrade to Spark 3.4 since we're on Databricks runtime 13.3LTS and would like keep using this. Thanks!

@hardiktalati
Copy link

Hey guys,
Any plans on upgrade to spark 3.4

@hardiktalati
Copy link

@chenliu0831 do you have a release date for spark 3.5 upgrades

@chenliu0831
Copy link
Contributor

I think we are getting very close #203 (only 2 test failures down to a dep issue ).

@hardiktalati
Copy link

@chenliu0831 how is it looking buddy?
can we expect release this week?
Also are you doing it for both 3.4 and 3.5?

@chenliu0831
Copy link
Contributor

@hardiktalati the fix for the 2 failures would need Deequ release I think, please be patient and I will post updates. I think it should solve both 3.4 & 3.5 and we may release it together.

@carlacha
Copy link

Hello! Any refreshing news? I know it’s complicated, and we have to be patient. I’m just checking if there is an approximate release date because my project is blocked and would like keep using this. Thanks😊

@datanikkthegreek
Copy link

@chenliu0831 Also from my side this Spark 3.5 is highly awaited 😃 Observing this thread for some time now.

No Spark 3.5 support would be show stopper using pydeequ and rather an argument for great expectations:)

Looking forward to it and thanks for moving this topic forward

@hardiktalati
Copy link

@chenliu0831 bro you mentioned it's nearly done how far

@hardiktalati
Copy link

hardiktalati commented Jun 6, 2024

@chenliu0831 any updates?? it is more than a month now..

@hardiktalati
Copy link

@chenliu0831 Would appreciate the response, we are blocked due to the pending upgrade

@sqlkabouter
Copy link

I'm evaluating PyDeequ vs. Great Expectations and after reading this all PyDeequ seems very unreliable. How can you take over a year add support for Spark 3.4?

@hardiktalati
Copy link

@chenliu0831 atleast response back ... so that we can make decision

@D2Bull
Copy link

D2Bull commented Jun 23, 2024

We developed a DQ solution based on Pydeequ.
After moving to Databricks we lost the ability to continue working with the solution.
We would appreciate your update regarding the implementation of SPARK 3.5 and the official support (or supporting Delta Table) as part of Pydeequ.

@rdsharma26
Copy link
Contributor

@hardiktalati @D2Bull @sqlkabouter We apologize for the inconvenience. We are actively working on the upgrade to Spark 3.4 and we aim to finish it as soon as possible. The upgrade to Spark 3.5 will follow right after.

@hardiktalati
Copy link

@rdsharma26 thanks for getting back. Is it possible to know tentative dates so that I can comms back to the colleagues

@rdsharma26
Copy link
Contributor

@hardiktalati At the moment, we don't have a date to share. We are trying to root cause the failure of two unit tests. Upgrading PyDeequ to Spark 3.4 and using Deequ's 2.0.7 Spark 3.4 library is resulting in the following error.

py4j.protocol.Py4JJavaError: An error occurred while calling o3327.run.
java.lang.NoSuchMethodError: 'breeze.generic.UFunc$UImpl2 breeze.linalg.DenseVector$.dv_dv_Op_Double_OpDiv()'

Once the RCA is done, if a new release of Deequ is required, then it can take a week until PyDeequ is fixed. If the fix is within PyDeequ itself, the new version with Spark 3.4 can be released within a few days.

Once the Spark 3.4 support is added, we will work on Spark 3.5 next.

@rdsharma26
Copy link
Contributor

We took a different approach from my previous message. Looks like we might need a new Deequ release to upgrade the Breeze dependency for Spark 3.4. In light of that , created a PR that adds Spark 3.5 support: #210

@datanikkthegreek
Copy link

@rdsharma26 Let us know once you have released a release candidate :)

Btw is it worth supporting older spark version? I think mantainance is 18 months. I would probably cut release with older versions at some point. Especially if breaking changes :)

@rdsharma26
Copy link
Contributor

Spark 3.5 support has been added in https://pypi.org/project/pydeequ/1.4.0/ 🚀

@datanikkthegreek That's a great point. We did recently drop support for Spark 2.4. Spark 3.4 is still a relatively newer version, so we will add support for it soon.

@D2Bull
Copy link

D2Bull commented Jul 3, 2024

Spark 3.5 support has been added in https://pypi.org/project/pydeequ/1.4.0/ 🚀

@datanikkthegreek That's a great point. We did recently drop support for Spark 2.4. Spark 3.4 is still a relatively newer version, so we will add support for it soon.

What I'm missing, it seems that the latest announcement is regarding Spark 3.30.
Where is Spark 3.5 mentioned?

🎉 Announcements 🎉
NEW!!! 1.1.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release brings many recency upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.

@rdsharma26
Copy link
Contributor

@D2Bull The README has been updated in the master branch. The project description in PyPI will not change until the next release.

@rodrigofp-possiblefinance
Copy link

rodrigofp-possiblefinance commented Jul 4, 2024

Hi folks.

I'm on a middle of a migration of my data quality pipeline from Spark 3.1 to 3.5. Unfortunately I don't have means to change my environment and I need to run my code at Spark 3.5.

Unfortunately things are broken at pydeequ 1.4.0 mostly because since Spark 3.4: : "... Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, some APIs such as SparkContext and RDD are not supported" (source)

Which causes things like this to break

error
error_msg

Any thoughts on it?

@SemyonSinchenko
Copy link

SemyonSinchenko commented Jul 4, 2024

Hi folks.

I'm on a middle of a migration of my data quality pipeline from Spark 3.1 to 3.5. Unfortunately I don't have means to change my environment and I need to run my code at Spark 3.5.

Unfortunately things are broken at pydeequ 1.4.0 mostly because since Spark 3.4: : "... Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, some APIs such as SparkContext and RDD are not supported" (source)

Which causes things like this to break

error error_msg

Any thoughts on it?

Just choose single user access mode isolation in Databricks and it will work. This error you mentioned is only related to SparkConnect environment (see Databricks shared access mode limitations)

@MrPowers FYI

@datanikkthegreek
Copy link

Hi everyone,

after the update, we are using DBR 14.3. on Databricks with Spark 3.5. when running with Job Clusters it fails with this error will with interactive cluster everything works fine. We use single access mode.

Any ideas? :)

image

@SemyonSinchenko
Copy link

Hi everyone,

after the update, we are using DBR 14.3. on Databricks with Spark 3.5. when running with Job Clusters it fails with this error will with interactive cluster everything works fine. We use single access mode.

Any ideas? :)

image

I would check first that Deequ JAR is in the JVM Class Path, actually. And that the version of Deequ is the same as reuqired by python-deequ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request Feature request
Projects
None yet
Development

No branches or pull requests