-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for dropping invalid rows for pyspark backend #1639
base: main
Are you sure you want to change the base?
Add support for dropping invalid rows for pyspark backend #1639
Conversation
- Add full table validation support for pyspark backend Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
…alidation Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
def equal_to( | ||
data: PysparkDataframeColumnObject, | ||
value: Any, | ||
should_validate_full_table: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of passing this in as an argument, you can use pandera.config.get_config_context
to get the full_table_validation
configuration value. This is so that the API for each check is consistent across the different backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cosmicBboy. Will make the recommended change.
Also, is there a way that you can suggest to keep the PANDERA_FULL_TABLE_VALIDATION
config value to be False
when the backend is pyspark and True
when the backend is pandas? Did not find a good way to do this, hence asking for a suggestion 😅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the config_context
context manager in the validate methods for each backend to control this behavior: https://github.com/unionai-oss/pandera/blob/main/pandera/config.py#L71
for example this is used in the polars backend:
pandera/pandera/api/polars/container.py
Line 53 in c24dda9
with config_context(validation_depth=get_validation_depth(check_obj)): |
thanks @nk4456542, this is awesome! Looks like some of the tests are broken
see https://github.com/unionai-oss/pandera/actions/runs/9054025532/job/24909236981?pr=1639. You can run these tests locally with |
- Remove unused decorators Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
- Will help to use the flag in backend validate functions Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
- More tests to come for full_table_validation config for built_in_checks after adding support in pyspark backend Signed-off-by: Zaheer Abbas <contact@zaheerabbas.dev>
Was a bit busy for the past two weeks, will continue working on this from this week |
hi @nk4456542 friendly ping on progress here, let me know if you need any help! |
@cosmicBboy - Apologies for dropping this, will pick this up this week. I work at a startup 😅, so I had my work cut out for one of the feature launches. I will contact you in the comments if I need help on this PR. |
thanks for the update @nk4456542, totally understand what it's like to be at a startup 👍 |
I have been caught up in work again 😞 . But would really like to work on this 😬 , would update here again when I can pick up this again. Apologies again for not being clear on the timelines |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1639 +/- ##
===========================================
- Coverage 94.28% 74.00% -20.28%
===========================================
Files 91 120 +29
Lines 7013 9190 +2177
===========================================
+ Hits 6612 6801 +189
- Misses 401 2389 +1988 ☔ View full report in Codecov by Sentry. |
Hi @zaheerabbas21 @cosmicBboy ! do you need any help on going on with this feature? |
Solves issue - #1540
Tasks to be completed as per this comment:
PANDERA_FULL_TABLE_VALIDATION
configuration. By default, it should be None and should be set depending on the validation backend. It should be True for the pandas check backend but False for the pyspark backend.PANDERA_FULL_TABLE_VALIDATION=False
is the current behaviorPANDERA_FULL_TABLE_VALIDATION=True
should return a boolean column indicating which element in the column passed the check.drop_invalid_rows
optionPANDERA_FULL_TABLE_VALIDATION
config anddrop_invalid_rows
optionPS: New to the repo 😄 , so please call out if I am not following repo guidelines or code style. Appreciate your help!