raise_warning option in a Check is not kept when saving the Schema to yaml or script. #694

michaelwalshe · 2021-12-02T15:14:06Z

As in the title, if you create a Schema and set 'raise_warning = True' in a check, then this behaviour will not persist if you then save the Schema to a yaml file or even as a Python script. This is on pandera v0.8.0.

Example:

df = pd.DataFrame(
    {'a': [1, 2, 3], 'b': [4, 5, 6]}
)

schema = pa.schema_inference.infer_schema(df)

schema = schema.update_column('a', checks=[pa.Check.isin([2, 3, 4], raise_warning = True)])

schema.validate(df)

This returns:

 UserWarning: <Schema Column(name=a, type=DataType(int64))> failed element-wise validator 0:
<Check isin: isin({2, 3, 4})>
failure cases:
   index  failure_case
0      0             1
  warnings.warn(error_msg, UserWarning)

However when I save to yaml and re-create the Schema:

schema.to_yaml("./example_schema.yml")

new_schema = pa.DataFrameSchema.from_yaml("./example_schema.yml")
new_schema.validate(df)

This returns (after all the traceback):

SchemaError: <Schema Column(name=a, type=int64)> failed element-wise validator 0:
<Check isin: isin({2, 3, 4})>
failure cases:
   index  failure_case
0      0             1

Expected behavior

I would expect the error/warning behaviour for Checks to persist if saving a Schema to file.

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2021-12-03T15:58:54Z

thanks for filing this bug @michaelwalshe!

So we want the checks to serialize into something like this:

columns:
  a:
    dtype: int64
    nullable: false
    checks:
      isin:
        allowed_values: [2, 3, 4]
        options:  # reserved keyword for serializing options
          raise_warning: True

The relevant parts of the codebase that would need to be updated are:

parse_checks: so that the check_statistics dict contains the user-provided check kwargs
format_checks: so that the to_script python serialization of an inferred schema is created correctly
_serialize_check_stats: to make sure the serialized data is in the correct format
_deserialize_check_stats: to make sure the yaml data is correctly parsed to re-constitute the checks correctly

This is fairly in the weeds of the pandera codebase, but let me know if you'd be open to making a contribute for this bugfix!

I've also added a "help wanted" tag on this issue in case anyone in the community wants to tackle it before I get to it.

alexismanuel · 2024-10-29T08:09:43Z

Is the issue being worked on ? If not I am willing to do it.
@cosmicBboy do you know if the mentioned parts of the codebase to be updated still are relevant ?

cosmicBboy · 2024-10-29T13:53:16Z

Hi @alexismanuel, go for it!
Yes, those are still the relevant parts, though some of the code has moved:

format checks: https://github.com/unionai-oss/pandera/blob/main/pandera/io/pandas_io.py#L449C6-L449C19
serialize check stats: https://github.com/unionai-oss/pandera/blob/main/pandera/io/pandas_io.py#L54
deserialize check stats: https://github.com/unionai-oss/pandera/blob/main/pandera/io/pandas_io.py#L180

alexismanuel · 2024-11-02T19:28:46Z

@cosmicBboy

While working on #694 about preserving raise_warning behavior, I noticed a potential inconsistency in how options are handled between parse_checks and parse_check_statistics.

Currently by following your guidance on the functions to update:

parse_checks returns check statistics with options (like raise_warning and ignore_na)
parse_check_statistics returns checks without these options, only reconstructing the base check

This creates an asymmetry in the roundtrip:

checks = [pa.Check.greater_than(1, raise_warning=True)]
check_statistics = parse_checks(checks)  # includes options
reconstructed_checks = parse_check_statistics(check_statistics)  # loses options

Should parse_check_statistics be updated to handle options and reconstruct checks with their original settings? This would make the serialization/deserialization process more consistent and preserve check behaviors across the roundtrip.

Let me know your thoughts on this approach.

michaelwalshe added the bug Something isn't working label Dec 2, 2021

cosmicBboy added the help wanted Extra attention is needed label Dec 3, 2021

alexismanuel linked a pull request Nov 2, 2024 that will close this issue

fix: preserve Check options in schema statistics roundtrip #1844

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raise_warning option in a Check is not kept when saving the Schema to yaml or script. #694

raise_warning option in a Check is not kept when saving the Schema to yaml or script. #694

michaelwalshe commented Dec 2, 2021

cosmicBboy commented Dec 3, 2021 •

edited

Loading

alexismanuel commented Oct 29, 2024

cosmicBboy commented Oct 29, 2024

alexismanuel commented Nov 2, 2024

raise_warning option in a Check is not kept when saving the Schema to yaml or script. #694

raise_warning option in a Check is not kept when saving the Schema to yaml or script. #694

Comments

michaelwalshe commented Dec 2, 2021

Example:

Expected behavior

cosmicBboy commented Dec 3, 2021 • edited Loading

alexismanuel commented Oct 29, 2024

cosmicBboy commented Oct 29, 2024

alexismanuel commented Nov 2, 2024

cosmicBboy commented Dec 3, 2021 •

edited

Loading