-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-33985: [C++] Add substrait serialization/deserialization for expressions #34834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33985: [C++] Add substrait serialization/deserialization for expressions #34834
Conversation
|
|
|
Leaving as draft as I need to add more test cases as well as python bindings. |
c7003a1 to
f2b0f8a
Compare
|
I've added python bindings. Now all that is needed is documentation / examples |
f2b0f8a to
97c4573
Compare
29f9f59 to
44e0023
Compare
|
I'm marking this ready for review. I still want to add a few unit tests that verify we correctly raise errors when given Substrait expressions that Arrow cannot handle (e.g. MultiOrList) but these should be a minor addition. |
|
The appveyor failure seems valid (though utterly baffling): https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47443264 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice and comprehensive description!
Did a review of the python bindings, which are looking good, just some minor comments.
python/pyarrow/_substrait.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful? (it can give multiple expressions with the same name?) Or could also raise an error instead?
I would maybe rather validate that len(exprs) == len(names)
(and in that case you can also do for expr, name in zip(exprs, names): .. to simplify the code a bit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was on the fence here. I don't have any real use case for it. I agree it feels better to just give an error if they don't give the same number of names. I'll fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to now raise an error if the length of names and exprs doesn't match (and I use zip now and added a test case).
python/pyarrow/_compute.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return _pas().serialize_expressions([self], "expression", schema, allow_udfs=allow_udfs) | |
| return _pas().serialize_expressions([self], ["expression"], schema, allow_udfs=allow_udfs) |
This currently causes the bug that a deserialized form of this has "e" as name:
In [16]: expr = pc.field("a") == 1
In [17]: buf = expr.to_substrait(pa.schema([('a', 'int32')]))
In [18]: pyarrow.substrait.deserialize_expressions(buf).expressions
Out[18]: {'e': <pyarrow.compute.Expression (FieldPath(0) == 1)>}
(so might be good to add a test for this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. This also gets caught now because exprs and names don't have the same length. I've added a test for this as well.
python/pyarrow/_substrait.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| the Substrait expression ``a_i32 + b_i32`` is different from the | |
| Substrait expression ``a_i64 + b_i64``. Pyarrow expressions are | |
| the Substrait expression ``a:i32 + b:i32`` is different from the | |
| Substrait expression ``a:i64 + b:i64``. Pyarrow expressions are |
? (that might be clearer that the actual field names are still "a" and "b" in both cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to this.
python/pyarrow/_substrait.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "udf" in the keyword name might be a bit confusing, as I think users of pyarrow will think in the form of actual UDFs defined by them, and not functions defined by arrow (but not part of substrait), as for the user, those are "built-in" functions, not UDFs.
I see that in the C++ code you are using allow_arrow_extensions as keyword. We can use that here as well? (or is there a specific reason you went for a different name?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I forgot I needed to reconcile this :) Both made sense to me.
Setting this to true enables both "arrow builtin functions that are not substrait functions" and "user registered udfs (or actual UDFs)" (from Substrait's perspective these are the same thing). My thinking was that, as Substrait's function support expands, the first case wouldn't be encountered as much. However, I think both names are still ok. I'm happy to switch to "allow_arrow_extensions"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to allow_arrow_extensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have an equals method on the Expression if you want to avoid string repr comparison (but not sure what the corner cases for either option)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned below this doesn't work because of the kernels binding.
python/pyarrow/tests/test_compute.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one I don't fully understand: this are expressions with calls but without any field reference, only with scalars which already have a type. So why is bound/unbound relevant in this case? (I would have expected that only be relevant for references to fields in the schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we bind expressions, in addition to looking up field references, we look up matching kernels (and will actually apply implicit casts) and we store the matching kernel as part of the expression. So "1 + 3" goes from:
function: add
args:
1: i64
3: i64
kernel: nullptr
...to...
function: add
args:
1: i64
3: i64
kernel: add<i64,i64>
And unfortunately, this means that Expression::Equals does not compare the two as equal. I've added #36427 to hopefully add that option at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the explanation! I hadn't considered that the function vs kernel distinction
python/pyarrow/tests/test_compute.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nicer to split out the substrait-based serialization to a separate test, but then we need to factor out the expression creation into a helper function? (it's certainly OK to just leave as is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and split them out.
And substrait doesn't have an "is_in" like function? (or are there plans for that?) |
It's an interesting point. We have things like this outside of expressions too. For example, the "join" node doesn't distinguish between an equality join (which can be done efficiently with a hashmap) and a non-equality join (which cannot). In that case we actually have both representations. The one people typically use is the "JoinRel" which is a logical operator and thus allowed to be more generic without concern for efficiency and the other one is the "HashJoinRel" which is more specific / physical, but typically not created by producers (instead planners or optimizers convert from one to the other). I think this is interesting because "is_in" vs. "singular-or-list" is basically a logical vs physical distinction for expressions which I don't think I've really considered before, but I agree with you its valid. In any case, it will be easy enough in Acero's converter, to recognize the cases that can collapse to |
|
I also created substrait-io/substrait#517 on the substrait side in case anyone wants to chime in. |
|
@westonpace @jorisvandenbossche is this ready to merge? |
…ed substrait version to 0.27
…ired up allow_udfs
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…is owned by the extension set and will not go out of scope.
fb3bd15 to
4280c4a
Compare
|
Yes. Sorry, I have rebased one last time and will merge as soon as CI is green. |
|
Failures appear unrelated. |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 702e9ca. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
|
@github-actions crossbow submit example-python-minimal-build-fedora-conda |
|
Revision: 4280c4a Submitted crossbow builds: ursacomputing/crossbow @ actions-44bfd49cd1
|
…ilter as a Substrait proto extended expression (#35570) ### Rationale for this change To close #34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by #35798 This PR needs/use this PRs/Issues: - #34834 - #34227 - #35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: #34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>
|
In case this helps anyone, there's an example here showing how this can be used through Python: https://gist.github.com/ianmcook/f70fc185d29ae97bdf85ffe0378c68e0 |
… expressions (apache#34834) ### Rationale for this change Substrait provides a library-independent way to represent compute expressions. By serializing and deserializing pyarrow compute expression to substrait we can allow interoperability with other libraries. Originally it was thought this would not be needed because users would be sending entire query plans (which contain expressions) back and forth and so there was no need to work with expressions by themselves. However, as more and more APIs and integration points emerge it turns out there are situations where serializing expressions by themselves is useful. For example, the proposed datasets protocol, or for the Java JNI datasets implementation (which uses Arrow-C++'s datasets) ### What changes are included in this PR? In Arrow-C++ we add two new methods to serialize and deserialize a collection of named, bound expressions to Substrait's ExtendedExpression message. In pyarrow we expose these two methods and also add utility methods to pyarrow.compute.Expression to convert a single expression to/from substrait (these will be encoded as an ExtendedExpression message with one expression named "expression") In addition, this PR exposed that we do not have very many bindings for arrow-functions to substrait-functions (previous work has mostly focused on the reverse direction). This PR adds many (though not all) new bindings. In addition, this PR adds ToProto for cast and both FromProto and ToProto support for the SingularOrList expression type (we convert is_in to SingularOrList and convert SingularOrList to an or list). This should provide support for all the sargable operators except between (there is no Arrow-C++ between function) and like (we still don't have arrow->substrait bindings for the string functions) which should be a sufficient set of expressions for a first release. ### Are these changes tested? Yes. ### Are there any user-facing changes? There are new features, as described above, but no backwards incompatible changes. ### Caveats There are a fair number of minor inconsistencies or surprises, many of which can be smoothed over by follow-up work. #### Bound Expressions Arrow-C++ has long had a distinction between "unbound expressions" (e.g. `a + b`) and "bound expressions" (e.g. `a:i32 + b:i32`). A bound expression is an expression that has been bound to a schema of some kind. Field references are resolved and the output type is known for every node of the AST. Pyarrow has hidden this complexity and most pyarrow compute expressions that the user encounters will be unbound expressions. Substrait is only capable (currently) of representing bound expressions. As a result, in order to serialize expressions, the user will need to provide an input schema. This can be an inconvenience for some workflows. To resolve this, I would like to eventually add support for unbound expressions to Substrait (substrait-io/substrait#515) Another minor annoyance of bound expressions is that an unbound pyarrow.compute.Expression object will not be equal to a bound pyarrow.compute.Expression object. It would make testing easier if we had a `pyarrow.compute.Expression.equals` variant that did not examine bound fields. #### Named field references Pyarrow datasets users are used to working with named field references. For example, one can set a filter `pc.equal(ds.field("x"), 7)`. Substrait, since it requires everything to be bound, considers named references to be superfluous and does everything in terms of numeric indices into the base schema. So the above expression, after round tripping, would become something like `pc.equal(ds.field(3), 7)` (assuming `"x"` is at index `3` in the schema used for serialization). This is something that can be overcome in the future if Substrait adds support for unbound expressions. Or, if that doesn't happen, it could still be implemented as a Substrait expression hint (this would allow named references to be used even if the user wants to work with bound expressions). #### UDFs UDFs ARE supported by this PR. This covers both "builtin arrow functions that do not exist in substrait (e.g. shift_left)" and "custom UDFs added with `register_scalar_function`". By default, UDFs will not be allowed when converting to Substrait because the resulting message would not be portable (e.g. you can't expect an external system to know about your custom UDFs). However, you can set the `allow_udfs` flag to True and these will be allowed. The Substrait representation will have the URI `urn:arrow:substrait_simple_extension_function`. **Options**: Although UDFs are allowed we do not yet support UDFs that take function options. These are trickier to convert to Substrait (though it should be possible in the future if someone is motivated enough). #### Rough Edges There are a few corner cases: * The function `is_in` converts to Substrait's `SingularOrList`. On conversion back to Arrow this becomes an or list. In other words, the function `is_in(5, [1, 2, 5])` converts to `5 == 1 || 5 == 2 || 5 == 5`. This is because Substrait's or list is more expression and allows things like `5 == field_ref(0) || 5 == 7` which cannot be expressed as an `is_in` function. * Arrow functions can either be converted to Substrait or are considered UDFs. However, there are a small number of functions which can "sometimes" be converted to Substrait depending on the function options. At the moment I think this is only the `is_null` function. The `is_null` function has an option `nan_is_null` which will allow you to consider `NaN` as a null value. Substrait has no single function that evaluates both `NULL` and `NaN` as true. In the meantime you can use `is_null || is_nan`. In the future, should someone want to, they could add special logic to convert this case. * Closes: apache#33985 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…der::Filter as a Substrait proto extended expression (apache#35570) ### Rationale for this change To close apache#34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798 This PR needs/use this PRs/Issues: - apache#34834 - apache#34227 - apache#35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: apache#34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…der::Filter as a Substrait proto extended expression (apache#35570) ### Rationale for this change To close apache#34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798 This PR needs/use this PRs/Issues: - apache#34834 - apache#34227 - apache#35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: apache#34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…ilter as a Substrait proto extended expression (#35570) ### Rationale for this change To close apache/arrow#34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache/arrow#35798 This PR needs/use this PRs/Issues: - apache/arrow#34834 - apache/arrow#34227 - apache/arrow#35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: #34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Rationale for this change
Substrait provides a library-independent way to represent compute expressions. By serializing and deserializing pyarrow compute expression to substrait we can allow interoperability with other libraries.
Originally it was thought this would not be needed because users would be sending entire query plans (which contain expressions) back and forth and so there was no need to work with expressions by themselves.
However, as more and more APIs and integration points emerge it turns out there are situations where serializing expressions by themselves is useful. For example, the proposed datasets protocol, or for the Java JNI datasets implementation (which uses Arrow-C++'s datasets)
What changes are included in this PR?
In Arrow-C++ we add two new methods to serialize and deserialize a collection of named, bound expressions to Substrait's ExtendedExpression message.
In pyarrow we expose these two methods and also add utility methods to pyarrow.compute.Expression to convert a single expression to/from substrait (these will be encoded as an ExtendedExpression message with one expression named "expression")
In addition, this PR exposed that we do not have very many bindings for arrow-functions to substrait-functions (previous work has mostly focused on the reverse direction). This PR adds many (though not all) new bindings.
In addition, this PR adds ToProto for cast and both FromProto and ToProto support for the SingularOrList expression type (we convert is_in to SingularOrList and convert SingularOrList to an or list).
This should provide support for all the sargable operators except between (there is no Arrow-C++ between function) and like (we still don't have arrow->substrait bindings for the string functions) which should be a sufficient set of expressions for a first release.
Are these changes tested?
Yes.
Are there any user-facing changes?
There are new features, as described above, but no backwards incompatible changes.
Caveats
There are a fair number of minor inconsistencies or surprises, many of which can be smoothed over by follow-up work.
Bound Expressions
Arrow-C++ has long had a distinction between "unbound expressions" (e.g.
a + b) and "bound expressions" (e.g.a:i32 + b:i32). A bound expression is an expression that has been bound to a schema of some kind. Field references are resolved and the output type is known for every node of the AST.Pyarrow has hidden this complexity and most pyarrow compute expressions that the user encounters will be unbound expressions. Substrait is only capable (currently) of representing bound expressions. As a result, in order to serialize expressions, the user will need to provide an input schema. This can be an inconvenience for some workflows. To resolve this, I would like to eventually add support for unbound expressions to Substrait (substrait-io/substrait#515)
Another minor annoyance of bound expressions is that an unbound pyarrow.compute.Expression object will not be equal to a bound pyarrow.compute.Expression object. It would make testing easier if we had a
pyarrow.compute.Expression.equalsvariant that did not examine bound fields.Named field references
Pyarrow datasets users are used to working with named field references. For example, one can set a filter
pc.equal(ds.field("x"), 7). Substrait, since it requires everything to be bound, considers named references to be superfluous and does everything in terms of numeric indices into the base schema. So the above expression, after round tripping, would become something likepc.equal(ds.field(3), 7)(assuming"x"is at index3in the schema used for serialization). This is something that can be overcome in the future if Substrait adds support for unbound expressions. Or, if that doesn't happen, it could still be implemented as a Substrait expression hint (this would allow named references to be used even if the user wants to work with bound expressions).UDFs
UDFs ARE supported by this PR. This covers both "builtin arrow functions that do not exist in substrait (e.g. shift_left)" and "custom UDFs added with
register_scalar_function". By default, UDFs will not be allowed when converting to Substrait because the resulting message would not be portable (e.g. you can't expect an external system to know about your custom UDFs). However, you can set theallow_udfsflag to True and these will be allowed. The Substrait representation will have the URIurn:arrow:substrait_simple_extension_function.Options: Although UDFs are allowed we do not yet support UDFs that take function options. These are trickier to convert to Substrait (though it should be possible in the future if someone is motivated enough).
Rough Edges
There are a few corner cases:
is_inconverts to Substrait'sSingularOrList. On conversion back to Arrow this becomes an or list. In other words, the functionis_in(5, [1, 2, 5])converts to5 == 1 || 5 == 2 || 5 == 5. This is because Substrait's or list is more expression and allows things like5 == field_ref(0) || 5 == 7which cannot be expressed as anis_infunction.is_nullfunction. Theis_nullfunction has an optionnan_is_nullwhich will allow you to considerNaNas a null value. Substrait has no single function that evaluates bothNULLandNaNas true. In the meantime you can useis_null || is_nan. In the future, should someone want to, they could add special logic to convert this case.