Support built-in dataframe checks and dataframe-level options #383

cosmicBboy · 2021-01-12T22:35:10Z

Is your feature request related to a problem? Please describe.

Discussion in #382 has lead to the need to support validation checks at the dataframe level. We'll need to discuss further to converge on the ideal API for this, keeping in mind that we need to optimize for (a) a good UX for the object-based and class-based API and (b) support data synthesis strategies, as all schemas should be able to produce data synthesis strategies that produce valid data.

cosmicBboy · 2021-01-12T23:03:12Z

Jumping off of this comment: #382 (comment)

To decompose this issue into 2 problems:

Should we have keyword validation options in the `DataFrameSchema` constructor?

One issue with 2. is that it could bloat the api, especially SchemaModel. Afterwards we could imagine other checks that could be done dataframe-wide, where do we draw the line?

I think the slope isn't so slippery here. On a case-by-case basis we can determine whether a validation check is so fundamental that it should be a keyword option.

For example, the pandas_dtype and nullable options in SeriesSchemaBase could be implemented as a Check, however it is intertwined with how to implement the validation routine when it comes to type-coercion. The built-in Checks, on the other hand, are self-contained and don't really interact with any other part of the validation process besides checking the contents of the dataframe/series.

Similarly, the min number of rows in a dataframe is (i) a common use case, (ii) determines whether the validated dataframe can be empty (min_rows = 0 means it can, which ties nicely into #332), and (iii) interacts with the data synthesis strategy because the size argument would need to be at least min_rows. I think this is enough to justify min_rows as a keyword option in the DataFrameSchema constructor.

How to best-support built-in checks in the `SchemaModel` API?

That said, we should also support built-in checks at the dataframe-level. I believe all of the built-in Checks currently implemented support both a dataframe and column/series input (will need to refactor the types, e.g. https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L460, which are mis-leading). The new extensions module introduced the concept of supported_types for checks, which should probably be folded into the Check class.

pa.DataFrameSchema(..., checks = pa.Check.NotEmpty())

class Schema(pa.SchemaModel):
    class Config:
        checks = [pa.Check.NotEmpty()] # new

I think this solution works, my concern here is that there would then be a discrepency between how built-in checks are expressed in Fields and how they're expressed at the dataframe-level.

Field(gt=100)

# vs

Field(checks=[pa.Check.gt(100)])

I wonder if we could follow this by doing something like:

class Schema(pa.SchemaModel):
    class Config:
        # support built-in checks as config attributes
        gt = 0
        lt = 100

jeffzi · 2021-01-12T23:32:08Z

Should we have keyword validation options in the DataFrameSchema constructor?

I agree with all your points about adding min_rows 👍

How to best-support built-in checks in the SchemaModel API?

Built-in checks as config attributes is indeed more consistent. We should interpret Config unknown attributes as registered checks. At the moment, unknown attributes are ignored.

Side note, we should encourage this style if we add more attributes:

import pandera as pa
from pandera.model import BaseConfig

class Schema(pa.SchemaModel): 
    class Config(BaseConfig): # IDE can help auto-complete valid attributes
        gt = 0
        lt = 100

antonl · 2021-02-16T18:27:42Z

Are there plans for making global checks serializable? I'd like to use pandera as a data definition on disk of sorts. Right now, global checks are just dropped when calling to_yaml. It would be nice to save them as keywords and look up from the extension API, but as far as I know, there is no way to do this.

cosmicBboy · 2021-02-16T19:24:12Z

Hey @antonl, I had to double check, but any check registered via extensions.register_check_method should be serializable as a yaml file or python script, see this post I just made: #417.

I did find a bug in the current serialization implementation that errors out when providing pandas_dtype arguments that are not of type PandasDtype, so for now you'll have to use e.g. pa.Int, pa.Float, pa.String, etc. instead of python built-in types, pandas type string aliases, or numpy types. Will make an issue and fix for this.

antonl · 2021-02-16T20:46:12Z

Hi @cosmicBboy! Thank you for your quick reply, I really like the library! I've posted a test case to demonstrate what I meant in the discussion you've linked.

antonl · 2021-03-22T21:35:28Z

@jeffzi @cosmicBboy I'm interested in working on this.

How would you support checks with multiple parameters?

import pandera as pa
from pandera.model import BaseConfig

class Schema(pa.SchemaModel): 
    class Config(BaseConfig): # IDE can help auto-complete valid attributes
        # zero_stat_check ?
        one_stat_check = 0
        # two_stat_check = ("a", 2)? {"first": "a", "second": 2}?

cosmicBboy · 2021-03-23T01:22:54Z

thanks @antonl, work on this would be much appreciated!

Currently the way these are handled for Field checks is with a dict.

class Schema(pa.SchemaModel):
    col: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 100})

So the {"first": "a", "second": 2} makes sense, although the tuple idea also makes sense and potentially more concise in some cases, maybe that syntax should be supported too (thoughts @jeffzi?)

Another thought I had about this issue is I wonder whether built-in dataframe checks make sense to specify in Config. This may be over-engineering it, but semantically it would make more sense to put in a different class, where Config should be reserved for configuration options.

class Schema(pa.SchemaModel):
    class Checks(GlobalChecks):  # or something 🤷‍♂️
        eq = 0
        ....

jeffzi · 2021-03-23T22:02:31Z

...the tuple idea also makes sense and potentially more concise in some cases,

Agreed, we could interpret a tuple as positional arguments and dict as keyword arguments.

Regarding DataFrame-level checks (global checks). I'm reluctant to add another inner class. That's not very pythonic (as far as I know). Config was inspired by pydantic to increase adoption, and I couldn't find a better alternative.

Config mirrors DataFrameSchema arguments. Adding built-in checks would be the exception but it's a similar concept to built-in checks in Field so it should be intuitive.

@antonl Your help is appreciated indeed 🎉 Any thoughts about the inner class idea?

antonl · 2021-03-24T03:14:40Z

Regarding the inner class idea, I'm not a fan. Most python people don't even know you can declare classes like this.

How about keeping the same field syntax for cols/index and making a class decorator to enable this sugar,

@with_checks(pa.Check.min_rows(5), ...)
class Schema(pa.SchemaModel):
    col: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 100})

The implementation would just attach a _checks/checks field to the type. Subclassing would just append to the field if it exists.

You could even enable something like this,

@with_checks(min_rows={"count": 5})
class Schema(pa.SchemaModel):
   ...

That maintains the feel of Field checks as desired.

cosmicBboy · 2021-03-24T13:14:07Z

I think for consistency of the SchemaModel interface, I'd like to keep two requirements:

the user doesn't have to worry about Check objects because checks are either built-in or expressed as methods decorated with @check or @dataframe_check.
minimize the surface area of how checks are expressed. Currently its with Field(<builtin_check>=...), @check methods, and registering custom checks with the extensions API and then using them as built-ins in Field.

Regarding DataFrame-level checks (global checks). I'm reluctant to add another inner class.

Regarding the inner class idea, I'm not a fan

I do agree, but I figured it was worth considering 🤔

That's not very pythonic (as far as I know).

As an aside, my understanding of what is "pythonic" has changed so many times over the years I don't really even know what it means anymore 😅

Anyway, I'm inclined to stick with defining built-in dataframe checks in Config with the same behavior as Field: built-in and registered checks can be specified as attributes, where custom checks are expressed as @dataframe_check methods.

antonl · 2021-05-01T04:25:39Z

Let me know if #478 looks acceptable, @cosmicBboy and @jeffzi! It's a little weird that we have to specify stuff on the Config class implicitly, but it is easy enough to implement.

We should perhaps have a "best practices" section in the doc that highlights the difference between methods decorated with dataframe_check and through Config. Do we recommend one or the other?

jeffzi · 2021-05-02T15:49:44Z

Best practices and cookbook sections would be great.

It's a little weird that we have to specify stuff on the Config class implicitly,

I kind of agree but Field can also implicitly swallow custom checks. Perhaps it feels less natural with the inner class Config because this mechanism is uncommon.

An alternative to the inner class could be a special attribute __config__ that must be assigned a Config instance.

class Base(pa.SchemaModel):
    a: Series[int]
    __config__ = Config(
        name="Base schema",
        coerce=True,
        ordered=True,
        multiindex_coerce=True,
        multiindex_strict=True,
        multiindex_name="mi",
        custom_check=(1, 2)
    )

Just testing the water, I can open a discussion if any of you @antonl @cosmicBboy think something like that is worth investigating.

We should perhaps have a "best practices" section in the doc that highlights the difference between methods decorated with dataframe_check and through Config. Do we recommend one or the other?

Imho, @check and @dataframe_check decorators are for defining one-offs, similarly to a Check defined with a lambda function in the DataFrameSchema api, or at least checks specific to a given model. Registered checks are for reusable checks.

antonl · 2021-05-03T01:23:37Z

Sorry I wasn't clear @jeffzi . I think the Config class is fine. I was just concerned that there are two ways to add dataframe checks (using methods and adding keys to Config), and they look totally different.

Also, the two ways have trade-offs: one is a one-off, but prevents serialization, and the other is global.

cosmicBboy · 2021-05-08T15:29:02Z

fixed by #478

cosmicBboy added enhancement New feature or request discussion needed labels Jan 12, 2021

cosmicBboy mentioned this issue Jan 12, 2021

Check dataframe size? #382

Closed

cosmicBboy modified the milestones: 0.7.0 Release, 0.8.0 release Feb 16, 2021

cosmicBboy mentioned this issue Feb 16, 2021

schema serialization doesn't work for non PandasDtype types. #418

Closed

3 tasks

antonl mentioned this issue Feb 16, 2021

to_yaml silently drops global checks #419

Closed

1 task

cosmicBboy added help wanted Extra attention is needed and removed discussion needed labels Feb 19, 2021

antonl mentioned this issue May 1, 2021

Allow attaching registered dataframe checks by using Config field names #478

Merged

cosmicBboy removed this from the 0.8.0 release milestone May 3, 2021

cosmicBboy closed this as completed May 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support built-in dataframe checks and dataframe-level options #383

Support built-in dataframe checks and dataframe-level options #383

cosmicBboy commented Jan 12, 2021

cosmicBboy commented Jan 12, 2021

jeffzi commented Jan 12, 2021 •

edited

Loading

antonl commented Feb 16, 2021

cosmicBboy commented Feb 16, 2021

antonl commented Feb 16, 2021

antonl commented Mar 22, 2021

cosmicBboy commented Mar 23, 2021 •

edited

Loading

jeffzi commented Mar 23, 2021

antonl commented Mar 24, 2021

cosmicBboy commented Mar 24, 2021

antonl commented May 1, 2021

jeffzi commented May 2, 2021 •

edited

Loading

antonl commented May 3, 2021

cosmicBboy commented May 8, 2021

Support built-in dataframe checks and dataframe-level options #383

Support built-in dataframe checks and dataframe-level options #383

Comments

cosmicBboy commented Jan 12, 2021

cosmicBboy commented Jan 12, 2021

Should we have keyword validation options in the DataFrameSchema constructor?

How to best-support built-in checks in the SchemaModel API?

jeffzi commented Jan 12, 2021 • edited Loading

antonl commented Feb 16, 2021

cosmicBboy commented Feb 16, 2021

antonl commented Feb 16, 2021

antonl commented Mar 22, 2021

cosmicBboy commented Mar 23, 2021 • edited Loading

jeffzi commented Mar 23, 2021

antonl commented Mar 24, 2021

cosmicBboy commented Mar 24, 2021

antonl commented May 1, 2021

jeffzi commented May 2, 2021 • edited Loading

antonl commented May 3, 2021

cosmicBboy commented May 8, 2021

Should we have keyword validation options in the `DataFrameSchema` constructor?

How to best-support built-in checks in the `SchemaModel` API?

jeffzi commented Jan 12, 2021 •

edited

Loading

cosmicBboy commented Mar 23, 2021 •

edited

Loading

jeffzi commented May 2, 2021 •

edited

Loading