-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix to_yaml serialization dropping global checks #428
Fix to_yaml serialization dropping global checks #428
Conversation
Thanks @antonl 🎉 . Heads-up, @cosmicBboy may not be able to review soon so I'm jumping in :) I've added my comments. Besides fixing tests and CI, is there anything else to cover?
I personally think the scope is fine. Those are important fixes but quite small and easier to understand in the context of this issue. |
…-to-yaml-drops-global-checks
Codecov Report
@@ Coverage Diff @@
## dev #428 +/- ##
==========================================
- Coverage 99.37% 99.30% -0.08%
==========================================
Files 21 21
Lines 2545 2572 +27
==========================================
+ Hits 2529 2554 +25
- Misses 16 18 +2
Continue to review full report at Codecov.
|
In these lines, dataframe_checks cannot be None based on the call condition.
@jeffzi I can't seem to see where coverage is missing (I get 403 error), so I can't fix the problems. Other than that, I think the PR is ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to go the extra mile and improve the implementation of registered checks.
I left some questions about the # fixme, otherwise LGTM 👍
Codecov reports 2 untested lines here but I cannot reproduce on my machine. I think @cosmicBboy faced similar issues in the past. I'm re-running the CI 🤞
pandera/checks.py
Outdated
class _CheckMeta(type): # pragma: no cover | ||
"""Check metaclass.""" | ||
|
||
# FIXME: this should probably just be moved to _CheckBase | ||
|
||
REGISTERED_CUSTOM_CHECKS: Dict[str, Callable] = {} # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate why it'd be better in CheckBase
? REGISTERED_CUSTOM_CHECKS
is referenced by __getattr__
, it seems to make sense to have them sitting next to each others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment got moved by black. Essentially, I think we could merge the _CheckMeta
with _CheckBase
. The metaclass actually confuses mypy
, and I don't see the benefit of using this mixin pattern here. Since this is a style choice, I didn't want to just do it though, so I added a comment so that we could discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move __getattr__
to _CheckBase
it's going to act on instances of the class and not on the class itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Metaclasses confuse me, had to look up the descriptor protocol docs again. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm proposing we just move the metaclass to _CheckBase
. It's not usable on anything else anyway because it requires a name property for example. I'll fix it and remove the fixme comment.
pandera/io.py
Outdated
|
||
def handle_stat_dtype(stat): | ||
# fixme: change interface to not require a dtype spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is fixed with pandas_dtype=None
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried that my fix is a hack and that there's a better way to do this (I don't like nonlocal). In the component-wise case it's an optimization that prevents a dtype lookup because you assume that each component has the same types for the statistics. Is this a valid assumption? Maybe we can just drop the dtype argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've committed a change that circumvents nonlocal
(not a big fan either). I also noticed that the dtype inference could fail when one stat was not a recognized dtype. More details in the general discussion.
An example of failure was:
import pandera as pa
schema = pa.DataFrameSchema(
columns={
"notype_column": pa.Column(
str,
checks=pa.Check.isin(["foo", "bar", "x", "xy"]), # list is not a PandasDtype
)
}
)
schema.to_yaml()
Notes for @cosmicBboy :
|
I'm still getting some sort of permission error for viewing coverage. |
I digged into the check dtype issue. As @antonl pointed out, For example: import pandera as pa
import pandas as pd
import datetime
schema = pa.DataFrameSchema(
columns={
"pd_timedelta_column": pa.Column(
pa.Timedelta, checks=[pa.Check.greater_than(pd.Timedelta(1000, unit="ns"))]
), # ok
"timedelta_column": pa.Column(
pa.Timedelta, checks=[pa.Check.greater_than(datetime.timedelta(seconds=1))]
), # Argument of check is assumed to be of type pandas.TimeDelta
}
)
schema.to_yaml()
schema.to_yaml()
# AttributeError: 'datetime.timedelta' object has no attribute 'delta' I think the (de)serialization of the However, it cannot be done reliably in the state of import pandera as pa
import pandas as pd
pa.PandasDtype(pd.Timestamp("20100101"))
# ValueError: Timestamp('2010-01-01 00:00:00') is not a valid PandasDtype
pa.PandasDtype(pd.Timedelta(1000))
# ValueError: Timedelta('0 days 00:00:00.000001') is not a valid PandasDtype We only have specialized serialization for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good @antonl! have a quick clarification (see inline comment) and this should be good to go.
I think we can ignore the
Weird, can you provide a screenshot of what you're seeing? Maybe you need to be part of the pandera-dev org group to see it... would be happy to send you an invite :) |
Seems like an issue on codecov's side that fixed itself. I'll wrap up the last minor bits and then we're good to go.
Sure, maybe when I commit some more things! I'm pushing for using pandera internally so I have a stake in seeing it get to 1.0. |
@cosmicBboy I think I can't see coverage when @jeffzi makes changes. I can see my own though. @jeffzi I'm happy with this. Let me know if you want me to revert the |
I'm ok with the change. Actually it makes it easier to have specialized edit: @cosmicBboy Should this me merged into |
* fix DataFrameSchema comparison with non-DataFrameSchema (#431) * increase cache version * fix DataFrameSchema comparison with non-DataFrameSchema * fix pylint error Co-authored-by: cosmicBboy <niels.bantilan@gmail.com> * add doc about attributes excluded by SchemaModel (#436) * fix empty SchemaModel (#434) * fix empty data type not supported for serialization (#435) * Fix to_yaml serialization dropping global checks (#428) * increase cache version * ci: add dataframe checks tests * bugfix: allow serialization of dataframe checks to_yaml * ci: add test to ensure serialization of lambda check fails * bugfix: ensure checks with no parameters generate appropriate schema * wip: allow looking up registered checks * fix: compare checks by name rather than by object equality * ci: black * ci: lint * enh: use REGISTERED_CUSTOM_CHECKS for attribute lookup, add dir method * enh: add to_yaml method to Schema, add unit test * ci: disable typechecking on _CheckMeta * ci: isort * ci: doctests * ci: improve coverage * ci: codecov In these lines, dataframe_checks cannot be None based on the call condition. * fix unrecognized check dtype during (de)serialization * fix handle_stat__dtype closures * enh: move metaclass onto _CheckBase * ci: add test that ensures to_yaml warns on unregistered checks * ci: revert adding duplicate test Co-authored-by: cosmicBboy <niels.bantilan@gmail.com> Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com> * remove unnecessary warnings (#441) since parse_checks handles cases of unregistered checks, the parts of the io module that raised a user warning were are never in the execution path and therefore never covered in tests * bugfix: preserve pandas extension types during validation (#443) * Update docs (#444) * update docs theme to Furo light * add furo * update styles * add favicon, update style Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com> Co-authored-by: Anton Loukianov <anton.loukianov@gmail.com>
Fix #419 .
In hindsight, perhaps I've opened up a whole can of worms here and this PR should be broken up into smaller pieces. If that's so, let me know.
I've discovered a few issues while working on this that have also taken the liberty to fix.
Checks
into statistics, there is an implicit assumption that if the check does not have arguments, then it is not serializable. This used to skiplambda
Checks.checks:
field is not filled at all, nor are the global checks collected. This is the main issue of this PR.functools.partial
function does not have the__code__
field.to_yaml
helper method toSchema
+ unit test as requested fromto_yaml
silently drops global checks #419,REGISTERED_CUSTOM_CHECKS
. Instead, custom checks are only added there, but there is additional support in themetaclass
to look there for custom checks when usinggetattr
lookups and dir.to_script
fails, instead of having the silent strip behavior.from_yaml
fails when deserializing unknown checks. This is currently anAttributeError
, but should probably be aSchemaInitError
or maybe a warning. I'm open to suggestions.