Add support for Spark Connect dataframes #1775

filipeo2-mck · 2024-08-02T18:47:32Z

Description

Fixes #1673 by adding support for the new Spark Connect dataframes, that are available since Spark 3.4.

Considerations

A new type DataFrameTypes was created under pandera/api/pyspark/types.py to standardize the type annotations for both types of DFs, accross all the files that need this new type annotation in the pyspark backend. I didn't change all type annotations yet, I'll change those later, when the solution and test design is approved.
In the pandera/backends/pyspark/checks.py file, the current pinned and outdated multimethod==1.10 package does not understand the new annotated type (I tried some approaches like Union[], TypeVar() etc), while the more recent versions 1.11+ were able to parse them correctly and dispatch the execution to the proper @overloaded function.
Unfortunately multimethod==1.10 is the last version that supports Python 3.8, whose end of life is planned to happen on this oct/2024. Upgrading multimethod to solve this incompability and removing support for Python 3.8 in Pandera (by consequence) is not supposed to happen right now, so I opted for replacing the overloaded functions by pure functions. When Python 3.8 support is dropped and multimethod is updated, we can return the original design if needed.
Unit tests were added in a manner that both SparkSession types (original non-connect and the new connect) always run, by parameterizing both types at the top of the test modules and all test functions inherit it. This design allows that future new tests inherit such structure.
With the addition of [connect] extra for pyspark, new requirements were generated.
types-pkg_resources was replaced by types-setuptools, as the yank message in PyPI suggests to do.

TODO

Change type annotations under pyspark/ namespace.
Increase coverage rates, if necessary.

…tions Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck · 2024-08-02T19:00:14Z

Hey @cosmicBboy , it looks like there are problems with the types-pkg_resources, I was able to install pre-commit but not able to make pre-commit install the mypy hooks, it fails with this message (both in my local computer and in GH Actions output):

~~Edit:~~
types-pkg_resources has been yanked from pypi, replacing it by types-setuptools solves the issue...

Edit²: Already solved by another PR

filipeo2-mck · 2024-08-02T19:23:32Z

pandera/backends/pyspark/container.py

- def _check_uniqueness(
- self,
- obj: DataFrame,
- schema,
- ) -> DataFrame:
- """Ensure uniqueness in dataframe columns.
-
- :param obj: dataframe to check.
- :param schema: schema object.
- :returns: dataframe checked.
- """
-


Removing a function without a body nor references to it

…mon and connect spark dataframes Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

codecov · 2024-08-02T23:38:30Z

Codecov Report

Attention: Patch coverage is 95.23810% with 2 lines in your changes missing coverage. Please review.

Project coverage is 93.28%. Comparing base (812b2a8) to head (8cd644e).
Report is 136 commits behind head on main.

Files	Patch %	Lines
pandera/api/pyspark/types.py	80.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1775      +/-   ##
==========================================
- Coverage   94.28%   93.28%   -1.00%     
==========================================
  Files          91      120      +29     
  Lines        7013     9133    +2120     
==========================================
+ Hits         6612     8520    +1908     
- Misses        401      613     +212

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck · 2024-08-05T14:15:14Z

Codecov Report

Attention: Patch coverage is 95.23810% with 2 lines in your changes missing coverage. Please review.

Project coverage is 93.29%. Comparing base (812b2a8) to head (231ac97).
Report is 134 commits behind head on main.

Files Patch % Lines
pandera/api/pyspark/types.py 80.00% 2 Missing ⚠️
Additional details and impacted files
☔ View full report in Codecov by Sentry. 📢 Have feedback on the report? Share it here.

About these 2 code lines not being covered by tests, we cannot cover them without adding new CI jobs that test both pyspark versions: <3.4 and >=3.4. Current CI always get the latest available version, which falls in the >=3.4 branch.

filipeo2-mck · 2024-08-08T12:35:14Z

Hi @cosmicBboy , were you able to take a look at this? I'm don't know if you are too busy

cosmicBboy · 2024-08-12T14:00:53Z

hey @filipeo2-mck thanks for the fix!

would you mind rebasing this on main? Just merged #1779, which fixes the setuptools issue.

Then you can run make nox-requirements as described here to resolve the merge conflicts

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck · 2024-08-12T14:25:13Z

Just did it @cosmicBboy , I believe it's ready for a review

filipeo2-mck · 2024-08-12T14:24:05Z

ci/requirements-py3.10-pandas1.5.3-pydantic1.10.11.txt

@@ -13,7 +13,6 @@ astroid==2.15.8
 asttokens==2.4.1
 # via stack-data
 asv==0.6.3
- # via -r /var/folders/wd/sx8dvgys011_mrcsd1_vrz1m0000gn/T/tmp6ejs7w6z


I just removed those, as they are not meaningful, ok?

how are you removing these?

if these are being removed manually, this will have no effect when ci/dev dependencies are regenerated

I just replaced the entire line using VS Code's regular expression search ( # via -r /var/folders.*\n and # -r /var/folders.*\n), I thought that it was some dirty from my setup but I just noted that the main branch already contains those.

a better solution for this would be to update the uv pip compile command here and here with the --no-annotate flag

In the application I work on, these annotations are very useful to understand which package is currently limiting an indirectly dependency version.
I implemented the suggested changes, just let me know which version you want to keep :)

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

cosmicBboy

One minor nit/question, otherwise lgtm!

pandera/api/pyspark/types.py

filipeo2-mck · 2024-08-29T15:06:31Z

Hey! A quick question, just out of curiosity: do you know when this fix will be made available as a new patch release?
Thanks in advance :)

Add minimal support for connect_dfs, without changing all type annota…

bce8e64

…tions Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck marked this pull request as draft August 2, 2024 18:50

filipeo2-mck mentioned this pull request Aug 2, 2024

BackendNotFoundError on databricks/pyspark cluster #1673

Closed

filipeo2-mck commented Aug 2, 2024

View reviewed changes

Change pyspark dependency and parameterize unit tests to run both com…

a126e29

…mon and connect spark dataframes Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck added 2 commits August 2, 2024 21:00

clean requirements and small typos

b38e01a

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

fix pylint

231ac97

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck marked this pull request as ready for review August 5, 2024 17:41

bring changes from main and recreate requirement files

4257ac9

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck commented Aug 12, 2024

View reviewed changes

remove annotations from nox outputs

8cd644e

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>

filipeo2-mck requested a review from cosmicBboy August 12, 2024 20:22

cosmicBboy approved these changes Aug 15, 2024

View reviewed changes

pandera/api/pyspark/types.py Show resolved Hide resolved

cosmicBboy merged commit d04bb3a into unionai-oss:main Aug 15, 2024
145 of 146 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Spark Connect dataframes #1775

Add support for Spark Connect dataframes #1775

filipeo2-mck commented Aug 2, 2024 •

edited

Loading

filipeo2-mck commented Aug 2, 2024 •

edited

Loading

filipeo2-mck Aug 2, 2024 •

edited

Loading

codecov bot commented Aug 2, 2024 •

edited

Loading

filipeo2-mck commented Aug 5, 2024 •

edited

Loading

Codecov Report

filipeo2-mck commented Aug 8, 2024

cosmicBboy commented Aug 12, 2024

filipeo2-mck commented Aug 12, 2024

filipeo2-mck Aug 12, 2024

cosmicBboy Aug 12, 2024

cosmicBboy Aug 12, 2024

filipeo2-mck Aug 12, 2024 •

edited

Loading

cosmicBboy Aug 12, 2024

filipeo2-mck Aug 12, 2024

cosmicBboy left a comment

filipeo2-mck commented Aug 29, 2024

Add support for Spark Connect dataframes #1775

Add support for Spark Connect dataframes #1775

Conversation

filipeo2-mck commented Aug 2, 2024 • edited Loading

Description

Considerations

TODO

filipeo2-mck commented Aug 2, 2024 • edited Loading

filipeo2-mck Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Aug 2, 2024 • edited Loading

Codecov Report

filipeo2-mck commented Aug 5, 2024 • edited Loading

Codecov Report

filipeo2-mck commented Aug 8, 2024

cosmicBboy commented Aug 12, 2024

filipeo2-mck commented Aug 12, 2024

filipeo2-mck Aug 12, 2024

Choose a reason for hiding this comment

cosmicBboy Aug 12, 2024

Choose a reason for hiding this comment

cosmicBboy Aug 12, 2024

Choose a reason for hiding this comment

filipeo2-mck Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

cosmicBboy Aug 12, 2024

Choose a reason for hiding this comment

filipeo2-mck Aug 12, 2024

Choose a reason for hiding this comment

cosmicBboy left a comment

Choose a reason for hiding this comment

filipeo2-mck commented Aug 29, 2024

filipeo2-mck commented Aug 2, 2024 •

edited

Loading

filipeo2-mck commented Aug 2, 2024 •

edited

Loading

filipeo2-mck Aug 2, 2024 •

edited

Loading

codecov bot commented Aug 2, 2024 •

edited

Loading

filipeo2-mck commented Aug 5, 2024 •

edited

Loading

filipeo2-mck Aug 12, 2024 •

edited

Loading