Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to_yaml serialization dropping global checks #428

Merged

Conversation

antonl
Copy link
Contributor

@antonl antonl commented Mar 4, 2021

Fix #419 .

In hindsight, perhaps I've opened up a whole can of worms here and this PR should be broken up into smaller pieces. If that's so, let me know.

I've discovered a few issues while working on this that have also taken the liberty to fix.

  • When parsing Checks into statistics, there is an implicit assumption that if the check does not have arguments, then it is not serializable. This used to skip lambda Checks.
  • The dataframe checks: field is not filled at all, nor are the global checks collected. This is the main issue of this PR.
  • Comparing checks/hashing fails when those checks are registered with the extension API. This is because the functools.partial function does not have the __code__ field.
  • Missing to_yaml helper method to Schema + unit test as requested from to_yaml silently drops global checks #419,
  • Confusing implementation of REGISTERED_CUSTOM_CHECKS. Instead, custom checks are only added there, but there is additional support in the metaclass to look there for custom checks when using getattr lookups and dir.
  • Added test to show that to_script fails, instead of having the silent strip behavior.
  • Added test that checks that from_yaml fails when deserializing unknown checks. This is currently an AttributeError, but should probably be a SchemaInitError or maybe a warning. I'm open to suggestions.

pandera/checks.py Outdated Show resolved Hide resolved
pandera/schema_statistics.py Outdated Show resolved Hide resolved
pandera/checks.py Outdated Show resolved Hide resolved
tests/io/test_io.py Outdated Show resolved Hide resolved
@jeffzi
Copy link
Collaborator

jeffzi commented Mar 6, 2021

Thanks @antonl 🎉 . Heads-up, @cosmicBboy may not be able to review soon so I'm jumping in :)

I've added my comments. Besides fixing tests and CI, is there anything else to cover?

In hindsight, perhaps I've opened up a whole can of worms here and this PR should be broken up into smaller pieces.

I personally think the scope is fine. Those are important fixes but quite small and easier to understand in the context of this issue.

@antonl antonl marked this pull request as ready for review March 18, 2021 23:42
@antonl antonl requested a review from jeffzi March 19, 2021 00:23
@antonl antonl changed the title WIP: Bugfix/#419 to yaml drops global checks Fix to_yaml serialization dropping global checks Mar 19, 2021
@codecov
Copy link

codecov bot commented Mar 19, 2021

Codecov Report

Merging #428 (e1ee5d5) into dev (c85ce63) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #428      +/-   ##
==========================================
- Coverage   99.37%   99.30%   -0.08%     
==========================================
  Files          21       21              
  Lines        2545     2572      +27     
==========================================
+ Hits         2529     2554      +25     
- Misses         16       18       +2     
Impacted Files Coverage Δ
pandera/checks.py 98.54% <100.00%> (+0.05%) ⬆️
pandera/extensions.py 98.27% <100.00%> (-0.03%) ⬇️
pandera/io.py 97.40% <100.00%> (-1.21%) ⬇️
pandera/model.py 100.00% <100.00%> (ø)
pandera/schema_statistics.py 100.00% <100.00%> (ø)
pandera/schemas.py 99.31% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c85ce63...e1ee5d5. Read the comment docs.

antonl added 5 commits March 18, 2021 18:13
In these lines, dataframe_checks cannot be None based on the call condition.
@antonl
Copy link
Contributor Author

antonl commented Mar 19, 2021

@jeffzi I can't seem to see where coverage is missing (I get 403 error), so I can't fix the problems. Other than that, I think the PR is ready.

Copy link
Collaborator

@jeffzi jeffzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to go the extra mile and improve the implementation of registered checks.

I left some questions about the # fixme, otherwise LGTM 👍

Codecov reports 2 untested lines here but I cannot reproduce on my machine. I think @cosmicBboy faced similar issues in the past. I'm re-running the CI 🤞

class _CheckMeta(type): # pragma: no cover
"""Check metaclass."""

# FIXME: this should probably just be moved to _CheckBase

REGISTERED_CUSTOM_CHECKS: Dict[str, Callable] = {} # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why it'd be better in CheckBase ? REGISTERED_CUSTOM_CHECKS is referenced by __getattr__, it seems to make sense to have them sitting next to each others.

Copy link
Contributor Author

@antonl antonl Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment got moved by black. Essentially, I think we could merge the _CheckMeta with _CheckBase. The metaclass actually confuses mypy, and I don't see the benefit of using this mixin pattern here. Since this is a style choice, I didn't want to just do it though, so I added a comment so that we could discuss.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move __getattr__ to _CheckBase it's going to act on instances of the class and not on the class itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Metaclasses confuse me, had to look up the descriptor protocol docs again. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm proposing we just move the metaclass to _CheckBase. It's not usable on anything else anyway because it requires a name property for example. I'll fix it and remove the fixme comment.

pandera/io.py Outdated

def handle_stat_dtype(stat):
# fixme: change interface to not require a dtype spec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is fixed with pandas_dtype=None above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried that my fix is a hack and that there's a better way to do this (I don't like nonlocal). In the component-wise case it's an optimization that prevents a dtype lookup because you assume that each component has the same types for the statistics. Is this a valid assumption? Maybe we can just drop the dtype argument.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've committed a change that circumvents nonlocal (not a big fan either). I also noticed that the dtype inference could fail when one stat was not a recognized dtype. More details in the general discussion.

An example of failure was:

import pandera as pa

schema = pa.DataFrameSchema(
    columns={
        "notype_column": pa.Column(
            str,
            checks=pa.Check.isin(["foo", "bar", "x", "xy"]), # list is not a PandasDtype
        )
    }
)
schema.to_yaml()

pandera/io.py Outdated Show resolved Hide resolved
@jeffzi
Copy link
Collaborator

jeffzi commented Mar 19, 2021

Notes for @cosmicBboy :

@antonl
Copy link
Contributor Author

antonl commented Mar 20, 2021

I'm still getting some sort of permission error for viewing coverage.

@jeffzi
Copy link
Collaborator

jeffzi commented Mar 20, 2021

I digged into the check dtype issue. As @antonl pointed out, pandera.io assumes that each component has the same types for the statistics

For example:

import pandera as pa
import pandas as pd
import datetime

schema = pa.DataFrameSchema(
    columns={
        "pd_timedelta_column": pa.Column(
            pa.Timedelta, checks=[pa.Check.greater_than(pd.Timedelta(1000, unit="ns"))]
        ), # ok
        "timedelta_column": pa.Column(
            pa.Timedelta, checks=[pa.Check.greater_than(datetime.timedelta(seconds=1))]
        ), # Argument of check is assumed to be of type pandas.TimeDelta
    }
)

schema.to_yaml()


schema.to_yaml()
# AttributeError: 'datetime.timedelta' object has no attribute 'delta'

I think the (de)serialization of the Check should not depend of the dtype of the column. Ideally we'd want to infer the dtype of the stats from their values.

However, it cannot be done reliably in the state of PandasDtype:

import pandera as pa
import pandas as pd

pa.PandasDtype(pd.Timestamp("20100101"))
# ValueError: Timestamp('2010-01-01 00:00:00') is not a valid PandasDtype
pa.PandasDtype(pd.Timedelta(1000))
# ValueError: Timedelta('0 days 00:00:00.000001') is not a valid PandasDtype

We only have specialized serialization for Timestamp and Timedelta but more could come in the future. I will open a separate issue, but we should wait for the dtype refactor (#369) to fix this bug.

Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good @antonl! have a quick clarification (see inline comment) and this should be good to go.

pandera/io.py Outdated Show resolved Hide resolved
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Mar 21, 2021

I think we can ignore the codecov/project failure

I'm still getting some sort of permission error for viewing coverage.

Weird, can you provide a screenshot of what you're seeing? Maybe you need to be part of the pandera-dev org group to see it... would be happy to send you an invite :)

@antonl
Copy link
Contributor Author

antonl commented Mar 22, 2021

Weird, can you provide a screenshot of what you're seeing? Maybe you need to be part of the pandera-dev org group to see it

Seems like an issue on codecov's side that fixed itself. I'll wrap up the last minor bits and then we're good to go.

...would be happy to send you an invite :)

Sure, maybe when I commit some more things! I'm pushing for using pandera internally so I have a stake in seeing it get to 1.0.

@antonl
Copy link
Contributor Author

antonl commented Mar 22, 2021

@cosmicBboy I think I can't see coverage when @jeffzi makes changes. I can see my own though.

@jeffzi I'm happy with this. Let me know if you want me to revert the _CheckMeta move to _CheckBase.

@jeffzi
Copy link
Collaborator

jeffzi commented Mar 22, 2021

@jeffzi I'm happy with this. Let me know if you want me to revert the _CheckMeta move to _CheckBase.

I'm ok with the change. Actually it makes it easier to have specialized Check classes if pandera goes that route (suggested in #429).

edit: @cosmicBboy Should this me merged into dev or release/0.8.0? 👀

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Mar 23, 2021

let's merge this to dev. It does add a new method SchemaModel.to_yaml, but let's just go with it being more of a bugfix :)

thanks for your work on this @antonl @jeffzi !

will merge this into master and cut the 0.6.3 release #438

@cosmicBboy cosmicBboy merged commit 32543d4 into unionai-oss:dev Mar 23, 2021
@antonl antonl deleted the bugfix/#419-to-yaml-drops-global-checks branch March 23, 2021 16:32
cosmicBboy added a commit that referenced this pull request Mar 27, 2021
* fix DataFrameSchema comparison with non-DataFrameSchema (#431)

* increase cache version

* fix DataFrameSchema comparison with non-DataFrameSchema

* fix pylint error

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>

* add doc about attributes excluded by SchemaModel (#436)

* fix empty SchemaModel (#434)

* fix empty data type not supported for serialization (#435)

* Fix to_yaml serialization dropping global checks (#428)

* increase cache version

* ci: add dataframe checks tests

* bugfix: allow serialization of dataframe checks to_yaml

* ci: add test to ensure serialization of lambda check fails

* bugfix: ensure checks with no parameters generate appropriate schema

* wip: allow looking up registered checks

* fix: compare checks by name rather than by object equality

* ci: black

* ci: lint

* enh: use REGISTERED_CUSTOM_CHECKS for attribute lookup, add dir method

* enh: add to_yaml method to Schema, add unit test

* ci: disable typechecking on _CheckMeta

* ci: isort

* ci: doctests

* ci: improve coverage

* ci: codecov

In these lines, dataframe_checks cannot be None based on the call condition.

* fix unrecognized check dtype during (de)serialization

* fix handle_stat__dtype closures

* enh: move metaclass onto _CheckBase

* ci: add test that ensures to_yaml warns on unregistered checks

* ci: revert adding duplicate test

Co-authored-by: cosmicBboy <niels.bantilan@gmail.com>
Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>

* remove unnecessary warnings (#441)

since parse_checks handles cases of unregistered checks, the
parts of the io module that raised a user warning were are never
in the execution path and therefore never covered in tests

* bugfix: preserve pandas extension types during validation (#443)

* Update docs (#444)

* update docs theme to Furo light

* add furo

* update styles

* add favicon, update style

Co-authored-by: Jean-Francois Zinque <jzinque@gmail.com>
Co-authored-by: Anton Loukianov <anton.loukianov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants