-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support built-in dataframe checks and dataframe-level options #383
Comments
Jumping off of this comment: #382 (comment) To decompose this issue into 2 problems: Should we have keyword validation options in the
|
I agree with all your points about adding
Built-in checks as config attributes is indeed more consistent. We should interpret Config unknown attributes as registered checks. At the moment, unknown attributes are ignored. Side note, we should encourage this style if we add more attributes: import pandera as pa
from pandera.model import BaseConfig
class Schema(pa.SchemaModel):
class Config(BaseConfig): # IDE can help auto-complete valid attributes
gt = 0
lt = 100 |
Are there plans for making global checks serializable? I'd like to use |
Hey @antonl, I had to double check, but any check registered via I did find a bug in the current serialization implementation that errors out when providing |
Hi @cosmicBboy! Thank you for your quick reply, I really like the library! I've posted a test case to demonstrate what I meant in the discussion you've linked. |
@jeffzi @cosmicBboy I'm interested in working on this. How would you support checks with multiple parameters? import pandera as pa
from pandera.model import BaseConfig
class Schema(pa.SchemaModel):
class Config(BaseConfig): # IDE can help auto-complete valid attributes
# zero_stat_check ?
one_stat_check = 0
# two_stat_check = ("a", 2)? {"first": "a", "second": 2}? |
thanks @antonl, work on this would be much appreciated! Currently the way these are handled for class Schema(pa.SchemaModel):
col: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 100}) So the Another thought I had about this issue is I wonder whether built-in dataframe checks make sense to specify in class Schema(pa.SchemaModel):
class Checks(GlobalChecks): # or something 🤷♂️
eq = 0
.... |
Agreed, we could interpret a tuple as positional arguments and dict as keyword arguments. Regarding DataFrame-level checks (global checks). I'm reluctant to add another inner class. That's not very pythonic (as far as I know).
@antonl Your help is appreciated indeed 🎉 Any thoughts about the inner class idea? |
Regarding the inner class idea, I'm not a fan. Most python people don't even know you can declare classes like this. How about keeping the same field syntax for cols/index and making a class decorator to enable this sugar, @with_checks(pa.Check.min_rows(5), ...)
class Schema(pa.SchemaModel):
col: Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 100}) The implementation would just attach a You could even enable something like this, @with_checks(min_rows={"count": 5})
class Schema(pa.SchemaModel):
... That maintains the feel of Field checks as desired. |
I think for consistency of the
I do agree, but I figured it was worth considering 🤔
As an aside, my understanding of what is "pythonic" has changed so many times over the years I don't really even know what it means anymore 😅 Anyway, I'm inclined to stick with defining built-in dataframe checks in |
Let me know if #478 looks acceptable, @cosmicBboy and @jeffzi! It's a little weird that we have to specify stuff on the Config class implicitly, but it is easy enough to implement. We should perhaps have a "best practices" section in the doc that highlights the difference between methods decorated with |
Best practices and cookbook sections would be great.
I kind of agree but An alternative to the inner class could be a special attribute class Base(pa.SchemaModel):
a: Series[int]
__config__ = Config(
name="Base schema",
coerce=True,
ordered=True,
multiindex_coerce=True,
multiindex_strict=True,
multiindex_name="mi",
custom_check=(1, 2)
) Just testing the water, I can open a discussion if any of you @antonl @cosmicBboy think something like that is worth investigating.
Imho, |
Sorry I wasn't clear @jeffzi . I think the Also, the two ways have trade-offs: one is a one-off, but prevents serialization, and the other is global. |
fixed by #478 |
Is your feature request related to a problem? Please describe.
Discussion in #382 has lead to the need to support validation checks at the dataframe level. We'll need to discuss further to converge on the ideal API for this, keeping in mind that we need to optimize for (a) a good UX for the object-based and class-based API and (b) support data synthesis strategies, as all schemas should be able to produce data synthesis strategies that produce valid data.
The text was updated successfully, but these errors were encountered: