Ibis check backend #1831

cosmicBboy · 2024-10-15T02:48:06Z

This PR implements the ibis check backend:

Custom checks
Element-wise checks (python UDF)
Support scalar boolean, column boolean, and table boolean outputs

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>

deepyaman · 2024-10-28T15:57:59Z

@cosmicBboy I can take a look at this today!

deepyaman

Sorry for the delay in taking a look at this! I think it makes sense overall, but have some general stylistic suggestions, as well as a more important question about whether can rely on order.

pandera/api/ibis/types.py

deepyaman · 2024-11-12T20:35:15Z

pandera/api/ibis/types.py

@@ -15,6 +20,9 @@ class CheckResult(NamedTuple):
    failure_cases: ir.Table


+IbisCheckObjects = Union[ir.Table, ir.Column]


Nit: pandera/engines/ibis_engine.py defines IbisObject = Union[ir.Column, ir.Table]; should that just import this instead, or could that set of types be different in some situation?

deepyaman · 2024-11-12T20:45:26Z

pandera/backends/ibis/checks.py

+            columns = (
+                [check_obj.key] if check_obj.key else check_obj.table.columns
+            )
+            _fn = self.check_fn
+            out = check_obj.table.mutate(
+                **{col: _fn(check_obj.table[col]) for col in columns}
+            )
+            out = out.select(columns)


Will need to import ibis.selectors as s above, but then you can:

Suggested change

columns = (

[check_obj.key] if check_obj.key else check_obj.table.columns

)

_fn = self.check_fn

out = check_obj.table.mutate(

**{col: _fn(check_obj.table[col]) for col in columns}

)

out = out.select(columns)

selector = s.cols(check_obj.key) if check_obj.key is not None else s.all()

out = check_obj.table.mutate(s.across(selector, self.check_fn)).select(

selector

)

More like the Polars implementation.

thanks! was scouring the ibis API for this but missed the across selector.

deepyaman · 2024-11-12T20:57:37Z

pandera/backends/ibis/checks.py

+import ibis
+import ibis.expr.types as ir
+from ibis.expr.types.groupby import GroupedTable
+from ibis.expr.datatypes import core as idt


Suggested change

from ibis.expr.datatypes import core as idt

import ibis.expr.datatypes as dt

would be the "standard" Ibis import for this.

deepyaman · 2024-11-12T21:11:17Z

pandera/backends/ibis/checks.py

+        else:
+            out = self.check_fn(check_obj)
+
+        if isinstance(out, (ir.BooleanScalar, ir.BooleanColumn)):


Suggested change

if isinstance(out, (ir.BooleanScalar, ir.BooleanColumn)):

if out.type().is_boolean():

deepyaman · 2024-11-12T21:34:42Z

pandera/backends/ibis/checks.py

+            bool_out = out.mutate(**{CHECK_OUTPUT_KEY: out.columns[0]})
+            for col in out.columns[1:]:
+                bool_out = bool_out.mutate(
+                    **{CHECK_OUTPUT_KEY: bool_out[CHECK_OUTPUT_KEY] & out[col]}
+                )
+            bool_out = bool_out.select(CHECK_OUTPUT_KEY)
+            return bool_out


Suggested change

bool_out = out.mutate(**{CHECK_OUTPUT_KEY: out.columns[0]})

for col in out.columns[1:]:

bool_out = bool_out.mutate(

**{CHECK_OUTPUT_KEY: bool_out[CHECK_OUTPUT_KEY] & out[col]}

)

bool_out = bool_out.select(CHECK_OUTPUT_KEY)

return bool_out

return out.select(

reduce(lambda x, y: x & out[y], out.columns, ibis.literal(True)).name(

CHECK_OUTPUT_KEY

)

)

seems simpler (and more like the logic in the Polars backend)

deepyaman · 2024-11-12T21:44:08Z

pandera/backends/ibis/checks.py

+            for _col, _dtype in out.schema().items():
+                assert isinstance(_dtype, idt.Boolean), (
+                    f"column {_col} is not boolean. If check function "
+                    "returns a dataframe, it must contain only boolean columns."
+                )


Is this necessary? If so, what's the ideal behavior (for my understanding)?

I.e. would you like to (automatically?) cast to boolean if the check doesn't return a boolean, or should we explicitly make sure each of the columns is a boolean?

Regardless, we can ignore the idt import altogether by checking dtype.is_boolean().

would you like to (automatically?) cast to boolean if the check doesn't return a boolean

No, this shouldn't be automatic... if a check function doesn't return a boolean it's a user/developer error.

The expectation of a check function is it should produce a scalar boolean, column boolean, or table boolean, since it's meant to produce a set of truth values about data.

deepyaman · 2024-11-12T21:57:25Z

pandera/backends/ibis/checks.py

+        _left = check_obj.table.mutate(_id=ibis.row_number())
+        _right = check_output.mutate(_id=ibis.row_number())
+        _t = _left.join(
+            check_output.mutate(_id=ibis.row_number()),
+            _left._id == _right._id,
+            how="inner",
+        ).drop("_id")


I don't think row order can be guaranteed like this.

I think the way to do this would be to create a set of result columns in apply() instead of overwriting the data columns.

Can you suggest a code change for this? I was racking my brain on how to basically get the failure cases from the original data based on the check output boolean values.

Yeah; let me make a PR to your branch, probably easier since it'll span a couple functions. The output of apply() is not used raw (without postprocessing) anywhere, right?

@cosmicBboy #1855 does this, with one caveat: instead of returning a table (in the case of a check function that takes a table and returns a table), I changed the API to return a dictionary of column names to expressions, because returning a table creates something that cannot be aligned with the original data.

This makes sense to me, but I wanted to see what you think.

(I also made most of the other changes suggested here in that PR.)

By the way, the reason things work is because the tests are using the pandas backend. Now that I think about it, we should test with DuckDB or something else (could even be Polars, I suppose), because Ibis will drop the pandas backend. This will also provide a bit better guardrails around doing stuff that only works in pandas.

deepyaman · 2024-11-12T22:00:06Z

pandera/backends/ibis/checks.py

+    ) -> CheckResult:
+        """Postprocesses the result of applying the check function."""
+        check_output = check_output.name(CHECK_OUTPUT_KEY)
+        failure_cases = check_obj.table.filter(~check_output)


Similar to comment for postprocess(..., ir.Table) below, I think check_output will need to have the original input as well as check results for this to work reliably.

* Apply suggestions from code review Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Fix row-order-dependent order by adding table cols Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> --------- Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

cosmicBboy added 3 commits October 9, 2024 23:23

[wip] add minimal ibis check backend implementation

4d35817

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>

support scalar, column, and table check output types

b509e5b

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>

support scalar, column, and table check output types

9cb21bf

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>

cosmicBboy requested a review from deepyaman October 15, 2024 20:15

deepyaman requested changes Nov 12, 2024

View reviewed changes

deepyaman mentioned this pull request Nov 14, 2024

Ibis check backend suggestions #1855

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ibis check backend #1831

Ibis check backend #1831

cosmicBboy commented Oct 15, 2024

deepyaman commented Oct 28, 2024

deepyaman left a comment

deepyaman Nov 12, 2024

deepyaman Nov 12, 2024

cosmicBboy Nov 13, 2024

deepyaman Nov 12, 2024

deepyaman Nov 12, 2024

deepyaman Nov 12, 2024

deepyaman Nov 12, 2024

cosmicBboy Nov 13, 2024

deepyaman Nov 12, 2024

cosmicBboy Nov 13, 2024

deepyaman Nov 13, 2024

deepyaman Nov 14, 2024 •

edited

Loading

deepyaman Nov 14, 2024

deepyaman Nov 12, 2024

		@@ -15,6 +20,9 @@ class CheckResult(NamedTuple):
		failure_cases: ir.Table


		IbisCheckObjects = Union[ir.Table, ir.Column]

	from ibis.expr.datatypes import core as idt
	import ibis.expr.datatypes as dt

	if isinstance(out, (ir.BooleanScalar, ir.BooleanColumn)):
	if out.type().is_boolean():

Ibis check backend #1831

Are you sure you want to change the base?

Ibis check backend #1831

Conversation

cosmicBboy commented Oct 15, 2024

deepyaman commented Oct 28, 2024

deepyaman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepyaman Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepyaman Nov 14, 2024 •

edited

Loading