Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Spark Connect dataframes #1775

Merged
merged 6 commits into from
Aug 15, 2024

Conversation

filipeo2-mck
Copy link
Contributor

@filipeo2-mck filipeo2-mck commented Aug 2, 2024

Description

Fixes #1673 by adding support for the new Spark Connect dataframes, that are available since Spark 3.4.

Considerations

  • A new type DataFrameTypes was created under pandera/api/pyspark/types.py to standardize the type annotations for both types of DFs, accross all the files that need this new type annotation in the pyspark backend. I didn't change all type annotations yet, I'll change those later, when the solution and test design is approved.
  • In the pandera/backends/pyspark/checks.py file, the current pinned and outdated multimethod==1.10 package does not understand the new annotated type (I tried some approaches like Union[], TypeVar() etc), while the more recent versions 1.11+ were able to parse them correctly and dispatch the execution to the proper @overloaded function.
    Unfortunately multimethod==1.10 is the last version that supports Python 3.8, whose end of life is planned to happen on this oct/2024. Upgrading multimethod to solve this incompability and removing support for Python 3.8 in Pandera (by consequence) is not supposed to happen right now, so I opted for replacing the overloaded functions by pure functions. When Python 3.8 support is dropped and multimethod is updated, we can return the original design if needed.
  • Unit tests were added in a manner that both SparkSession types (original non-connect and the new connect) always run, by parameterizing both types at the top of the test modules and all test functions inherit it. This design allows that future new tests inherit such structure.
  • With the addition of [connect] extra for pyspark, new requirements were generated.
  • types-pkg_resources was replaced by types-setuptools, as the yank message in PyPI suggests to do.

TODO

  • Change type annotations under pyspark/ namespace.
  • Increase coverage rates, if necessary.

…tions

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
@filipeo2-mck filipeo2-mck marked this pull request as draft August 2, 2024 18:50
@filipeo2-mck
Copy link
Contributor Author

filipeo2-mck commented Aug 2, 2024

Hey @cosmicBboy , it looks like there are problems with the types-pkg_resources, I was able to install pre-commit but not able to make pre-commit install the mypy hooks, it fails with this message (both in my local computer and in GH Actions output):
image

Edit:
types-pkg_resources has been yanked from pypi, replacing it by types-setuptools solves the issue...
image

Edit²: Already solved by another PR

Comment on lines -556 to -567
def _check_uniqueness(
self,
obj: DataFrame,
schema,
) -> DataFrame:
"""Ensure uniqueness in dataframe columns.

:param obj: dataframe to check.
:param schema: schema object.
:returns: dataframe checked.
"""

Copy link
Contributor Author

@filipeo2-mck filipeo2-mck Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing a function without a body nor references to it

…mon and connect spark dataframes

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Copy link

codecov bot commented Aug 2, 2024

Codecov Report

Attention: Patch coverage is 95.23810% with 2 lines in your changes missing coverage. Please review.

Project coverage is 93.28%. Comparing base (812b2a8) to head (8cd644e).
Report is 136 commits behind head on main.

Files Patch % Lines
pandera/api/pyspark/types.py 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1775      +/-   ##
==========================================
- Coverage   94.28%   93.28%   -1.00%     
==========================================
  Files          91      120      +29     
  Lines        7013     9133    +2120     
==========================================
+ Hits         6612     8520    +1908     
- Misses        401      613     +212     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
@filipeo2-mck
Copy link
Contributor Author

filipeo2-mck commented Aug 5, 2024

Codecov Report

Attention: Patch coverage is 95.23810% with 2 lines in your changes missing coverage. Please review.

Project coverage is 93.29%. Comparing base (812b2a8) to head (231ac97).
Report is 134 commits behind head on main.

Files Patch % Lines
pandera/api/pyspark/types.py 80.00% 2 Missing ⚠️
Additional details and impacted files
☔ View full report in Codecov by Sentry. 📢 Have feedback on the report? Share it here.

About these 2 code lines not being covered by tests, we cannot cover them without adding new CI jobs that test both pyspark versions: <3.4 and >=3.4. Current CI always get the latest available version, which falls in the >=3.4 branch.

@filipeo2-mck filipeo2-mck marked this pull request as ready for review August 5, 2024 17:41
@filipeo2-mck
Copy link
Contributor Author

Hi @cosmicBboy , were you able to take a look at this? I'm don't know if you are too busy

@cosmicBboy
Copy link
Collaborator

hey @filipeo2-mck thanks for the fix!

would you mind rebasing this on main? Just merged #1779, which fixes the setuptools issue.

Then you can run make nox-requirements as described here to resolve the merge conflicts

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
@filipeo2-mck
Copy link
Contributor Author

Just did it @cosmicBboy , I believe it's ready for a review

@@ -13,7 +13,6 @@ astroid==2.15.8
asttokens==2.4.1
# via stack-data
asv==0.6.3
# via -r /var/folders/wd/sx8dvgys011_mrcsd1_vrz1m0000gn/T/tmp6ejs7w6z
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed those, as they are not meaningful, ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are you removing these?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these are being removed manually, this will have no effect when ci/dev dependencies are regenerated

Copy link
Contributor Author

@filipeo2-mck filipeo2-mck Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just replaced the entire line using VS Code's regular expression search ( # via -r /var/folders.*\n and # -r /var/folders.*\n), I thought that it was some dirty from my setup but I just noted that the main branch already contains those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a better solution for this would be to update the uv pip compile command here and here with the --no-annotate flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the application I work on, these annotations are very useful to understand which package is currently limiting an indirectly dependency version.
I implemented the suggested changes, just let me know which version you want to keep :)

Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nit/question, otherwise lgtm!

pandera/api/pyspark/types.py Show resolved Hide resolved
@cosmicBboy cosmicBboy merged commit d04bb3a into unionai-oss:main Aug 15, 2024
145 of 146 checks passed
@filipeo2-mck
Copy link
Contributor Author

Hey! A quick question, just out of curiosity: do you know when this fix will be made available as a new patch release?
Thanks in advance :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BackendNotFoundError on databricks/pyspark cluster
2 participants