Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raise_warning option in a Check is not kept when saving the Schema to yaml or script. #694

Open
michaelwalshe opened this issue Dec 2, 2021 · 4 comments · May be fixed by #1844
Open

raise_warning option in a Check is not kept when saving the Schema to yaml or script. #694

michaelwalshe opened this issue Dec 2, 2021 · 4 comments · May be fixed by #1844
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@michaelwalshe
Copy link

As in the title, if you create a Schema and set 'raise_warning = True' in a check, then this behaviour will not persist if you then save the Schema to a yaml file or even as a Python script. This is on pandera v0.8.0.

Example:

df = pd.DataFrame(
    {'a': [1, 2, 3], 'b': [4, 5, 6]}
)

schema = pa.schema_inference.infer_schema(df)

schema = schema.update_column('a', checks=[pa.Check.isin([2, 3, 4], raise_warning = True)])

schema.validate(df)

This returns:

 UserWarning: <Schema Column(name=a, type=DataType(int64))> failed element-wise validator 0:
<Check isin: isin({2, 3, 4})>
failure cases:
   index  failure_case
0      0             1
  warnings.warn(error_msg, UserWarning)

However when I save to yaml and re-create the Schema:

schema.to_yaml("./example_schema.yml")

new_schema = pa.DataFrameSchema.from_yaml("./example_schema.yml")
new_schema.validate(df)

This returns (after all the traceback):

SchemaError: <Schema Column(name=a, type=int64)> failed element-wise validator 0:
<Check isin: isin({2, 3, 4})>
failure cases:
   index  failure_case
0      0             1

Expected behavior

I would expect the error/warning behaviour for Checks to persist if saving a Schema to file.

@michaelwalshe michaelwalshe added the bug Something isn't working label Dec 2, 2021
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Dec 3, 2021

thanks for filing this bug @michaelwalshe!

So we want the checks to serialize into something like this:

columns:
  a:
    dtype: int64
    nullable: false
    checks:
      isin:
        allowed_values: [2, 3, 4]
        options:  # reserved keyword for serializing options
          raise_warning: True

The relevant parts of the codebase that would need to be updated are:

  • parse_checks: so that the check_statistics dict contains the user-provided check kwargs
  • format_checks: so that the to_script python serialization of an inferred schema is created correctly
  • _serialize_check_stats: to make sure the serialized data is in the correct format
  • _deserialize_check_stats: to make sure the yaml data is correctly parsed to re-constitute the checks correctly

This is fairly in the weeds of the pandera codebase, but let me know if you'd be open to making a contribute for this bugfix!

I've also added a "help wanted" tag on this issue in case anyone in the community wants to tackle it before I get to it.

@cosmicBboy cosmicBboy added the help wanted Extra attention is needed label Dec 3, 2021
@alexismanuel
Copy link

Is the issue being worked on ? If not I am willing to do it.
@cosmicBboy do you know if the mentioned parts of the codebase to be updated still are relevant ?

@cosmicBboy
Copy link
Collaborator

Hi @alexismanuel, go for it!
Yes, those are still the relevant parts, though some of the code has moved:

@alexismanuel
Copy link

@cosmicBboy

While working on #694 about preserving raise_warning behavior, I noticed a potential inconsistency in how options are handled between parse_checks and parse_check_statistics.

Currently by following your guidance on the functions to update:

  • parse_checks returns check statistics with options (like raise_warning and ignore_na)
  • parse_check_statistics returns checks without these options, only reconstructing the base check

This creates an asymmetry in the roundtrip:

checks = [pa.Check.greater_than(1, raise_warning=True)]
check_statistics = parse_checks(checks)  # includes options
reconstructed_checks = parse_check_statistics(check_statistics)  # loses options

Should parse_check_statistics be updated to handle options and reconstruct checks with their original settings? This would make the serialization/deserialization process more consistent and preserve check behaviors across the roundtrip.

Let me know your thoughts on this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants