-
-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Spark Connect dataframes #1775
Conversation
…tions Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Edit²: Already solved by another PR |
def _check_uniqueness( | ||
self, | ||
obj: DataFrame, | ||
schema, | ||
) -> DataFrame: | ||
"""Ensure uniqueness in dataframe columns. | ||
|
||
:param obj: dataframe to check. | ||
:param schema: schema object. | ||
:returns: dataframe checked. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing a function without a body nor references to it
…mon and connect spark dataframes Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1775 +/- ##
==========================================
- Coverage 94.28% 93.28% -1.00%
==========================================
Files 91 120 +29
Lines 7013 9133 +2120
==========================================
+ Hits 6612 8520 +1908
- Misses 401 613 +212 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
About these 2 code lines not being covered by tests, we cannot cover them without adding new CI jobs that test both |
Hi @cosmicBboy , were you able to take a look at this? I'm don't know if you are too busy |
hey @filipeo2-mck thanks for the fix! would you mind rebasing this on Then you can run |
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
Just did it @cosmicBboy , I believe it's ready for a review |
@@ -13,7 +13,6 @@ astroid==2.15.8 | |||
asttokens==2.4.1 | |||
# via stack-data | |||
asv==0.6.3 | |||
# via -r /var/folders/wd/sx8dvgys011_mrcsd1_vrz1m0000gn/T/tmp6ejs7w6z |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed those, as they are not meaningful, ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are you removing these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these are being removed manually, this will have no effect when ci/dev dependencies are regenerated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just replaced the entire line using VS Code's regular expression search ( # via -r /var/folders.*\n
and # -r /var/folders.*\n
), I thought that it was some dirty from my setup but I just noted that the main
branch already contains those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the application I work on, these annotations are very useful to understand which package is currently limiting an indirectly dependency version.
I implemented the suggested changes, just let me know which version you want to keep :)
Signed-off-by: Filipe Oliveira <filipe_oliveira@mckinsey.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor nit/question, otherwise lgtm!
Hey! A quick question, just out of curiosity: do you know when this fix will be made available as a new patch release? |
Description
Fixes #1673 by adding support for the new Spark Connect dataframes, that are available since Spark 3.4.
Considerations
DataFrameTypes
was created underpandera/api/pyspark/types.py
to standardize the type annotations for both types of DFs, accross all the files that need this new type annotation in the pyspark backend. I didn't change all type annotations yet, I'll change those later, when the solution and test design is approved.pandera/backends/pyspark/checks.py
file, the current pinned and outdatedmultimethod==1.10
package does not understand the new annotated type (I tried some approaches likeUnion[]
,TypeVar()
etc), while the more recent versions1.11+
were able to parse them correctly and dispatch the execution to the proper@overload
ed function.Unfortunately
multimethod==1.10
is the last version that supports Python 3.8, whose end of life is planned to happen on this oct/2024. Upgradingmultimethod
to solve this incompability and removing support for Python 3.8 in Pandera (by consequence) is not supposed to happen right now, so I opted for replacing the overloaded functions by pure functions. When Python 3.8 support is dropped andmultimethod
is updated, we can return the original design if needed.SparkSession
types (original non-connect and the new connect) always run, by parameterizing both types at the top of the test modules and all test functions inherit it. This design allows that future new tests inherit such structure.[connect]
extra forpyspark
, new requirements were generated.types-pkg_resources
was replaced bytypes-setuptools
, as the yank message in PyPI suggests to do.TODO
pyspark/
namespace.