-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ibis check backend #1831
base: ibis-dev
Are you sure you want to change the base?
Ibis check backend #1831
Conversation
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
@cosmicBboy I can take a look at this today! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in taking a look at this! I think it makes sense overall, but have some general stylistic suggestions, as well as a more important question about whether can rely on order.
@@ -15,6 +20,9 @@ class CheckResult(NamedTuple): | |||
failure_cases: ir.Table | |||
|
|||
|
|||
IbisCheckObjects = Union[ir.Table, ir.Column] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: pandera/engines/ibis_engine.py defines IbisObject = Union[ir.Column, ir.Table]
; should that just import this instead, or could that set of types be different in some situation?
pandera/backends/ibis/checks.py
Outdated
columns = ( | ||
[check_obj.key] if check_obj.key else check_obj.table.columns | ||
) | ||
_fn = self.check_fn | ||
out = check_obj.table.mutate( | ||
**{col: _fn(check_obj.table[col]) for col in columns} | ||
) | ||
out = out.select(columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to import ibis.selectors as s
above, but then you can:
columns = ( | |
[check_obj.key] if check_obj.key else check_obj.table.columns | |
) | |
_fn = self.check_fn | |
out = check_obj.table.mutate( | |
**{col: _fn(check_obj.table[col]) for col in columns} | |
) | |
out = out.select(columns) | |
selector = s.cols(check_obj.key) if check_obj.key is not None else s.all() | |
out = check_obj.table.mutate(s.across(selector, self.check_fn)).select( | |
selector | |
) |
More like the Polars implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! was scouring the ibis API for this but missed the across
selector.
pandera/backends/ibis/checks.py
Outdated
import ibis | ||
import ibis.expr.types as ir | ||
from ibis.expr.types.groupby import GroupedTable | ||
from ibis.expr.datatypes import core as idt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from ibis.expr.datatypes import core as idt | |
import ibis.expr.datatypes as dt |
would be the "standard" Ibis import for this.
pandera/backends/ibis/checks.py
Outdated
else: | ||
out = self.check_fn(check_obj) | ||
|
||
if isinstance(out, (ir.BooleanScalar, ir.BooleanColumn)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(out, (ir.BooleanScalar, ir.BooleanColumn)): | |
if out.type().is_boolean(): |
pandera/backends/ibis/checks.py
Outdated
bool_out = out.mutate(**{CHECK_OUTPUT_KEY: out.columns[0]}) | ||
for col in out.columns[1:]: | ||
bool_out = bool_out.mutate( | ||
**{CHECK_OUTPUT_KEY: bool_out[CHECK_OUTPUT_KEY] & out[col]} | ||
) | ||
bool_out = bool_out.select(CHECK_OUTPUT_KEY) | ||
return bool_out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool_out = out.mutate(**{CHECK_OUTPUT_KEY: out.columns[0]}) | |
for col in out.columns[1:]: | |
bool_out = bool_out.mutate( | |
**{CHECK_OUTPUT_KEY: bool_out[CHECK_OUTPUT_KEY] & out[col]} | |
) | |
bool_out = bool_out.select(CHECK_OUTPUT_KEY) | |
return bool_out | |
return out.select( | |
reduce(lambda x, y: x & out[y], out.columns, ibis.literal(True)).name( | |
CHECK_OUTPUT_KEY | |
) | |
) |
seems simpler (and more like the logic in the Polars backend)
pandera/backends/ibis/checks.py
Outdated
for _col, _dtype in out.schema().items(): | ||
assert isinstance(_dtype, idt.Boolean), ( | ||
f"column {_col} is not boolean. If check function " | ||
"returns a dataframe, it must contain only boolean columns." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? If so, what's the ideal behavior (for my understanding)?
I.e. would you like to (automatically?) cast to boolean if the check doesn't return a boolean, or should we explicitly make sure each of the columns is a boolean?
Regardless, we can ignore the idt
import altogether by checking dtype.is_boolean()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you like to (automatically?) cast to boolean if the check doesn't return a boolean
No, this shouldn't be automatic... if a check function doesn't return a boolean it's a user/developer error.
The expectation of a check function is it should produce a scalar boolean, column boolean, or table boolean, since it's meant to produce a set of truth values about data.
pandera/backends/ibis/checks.py
Outdated
_left = check_obj.table.mutate(_id=ibis.row_number()) | ||
_right = check_output.mutate(_id=ibis.row_number()) | ||
_t = _left.join( | ||
check_output.mutate(_id=ibis.row_number()), | ||
_left._id == _right._id, | ||
how="inner", | ||
).drop("_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think row order can be guaranteed like this.
I think the way to do this would be to create a set of result columns in apply()
instead of overwriting the data columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you suggest a code change for this? I was racking my brain on how to basically get the failure cases from the original data based on the check output boolean values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah; let me make a PR to your branch, probably easier since it'll span a couple functions. The output of apply()
is not used raw (without postprocessing) anywhere, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cosmicBboy #1855 does this, with one caveat: instead of returning a table (in the case of a check function that takes a table and returns a table), I changed the API to return a dictionary of column names to expressions, because returning a table creates something that cannot be aligned with the original data.
This makes sense to me, but I wanted to see what you think.
(I also made most of the other changes suggested here in that PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, the reason things work is because the tests are using the pandas backend. Now that I think about it, we should test with DuckDB or something else (could even be Polars, I suppose), because Ibis will drop the pandas backend. This will also provide a bit better guardrails around doing stuff that only works in pandas.
) -> CheckResult: | ||
"""Postprocesses the result of applying the check function.""" | ||
check_output = check_output.name(CHECK_OUTPUT_KEY) | ||
failure_cases = check_obj.table.filter(~check_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to comment for postprocess(..., ir.Table)
below, I think check_output
will need to have the original input as well as check results for this to work reliably.
* Apply suggestions from code review Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Fix row-order-dependent order by adding table cols Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> --------- Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
This PR implements the ibis check backend: