-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-17351: [C++][Compute] Implement a parser for Expressions #14287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
|
|
a8bfc6e to
6055c3f
Compare
|
If this is meant to be a public API, then I'd expect a design discussion on the ML. |
|
Good point, I'll start a discussion. |
ce351c9 to
f8a1dac
Compare
|
@pitrou @westonpace I've updated the parser to reflect the discussion on the mailing list. The language now looks like the traditional function call syntax instead of the lisp-style syntax. I've also gotten rid of the |
1f84e92 to
10b05fc
Compare
d2dcad3 to
0cfb4b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the parser looks simple now, but I would like to ask if you had considered using a parser generator instead (for future maintainability)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that parser generators are more maintainable. They require learning a new syntax, adding an extra build step, adding an extra dependency to the project, and require debugging generated code. In my experience writing parsers, it's always fewer lines of code and simpler to use to hand-write the parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits but this seems straightforward enough.
Some overall thoughts / questions:
Would this support field names or string literals with non-ASCII characters? I think it is ok to force the control characters to be culture-invariant ASCII (e.g. like JSON does) but I don't think we can restrict field names or literals.
Actually, given the lack of control characters surrounding literals, would this support string literals like $string:my string literal with, and )?
The error messages could be improved to include the exact index of the error, and then some truncated preview before and after the error position. For example:
Error at index 17: expected a close parentheses on a function call...
...pper(x + $stri...
^
Even if we don't want to go the parser generator route, do we want to include a grammar for others? Having a grammar could be useful for integration into other languages / implementations or for building auto-complete utilities in interactive expression editors.
0cfb4b9 to
bc98911
Compare
|
Added better error handling and support for escaping stuff with "". |
|
LGTM! For the next PRs, would be great to invest a bit to improve even more the error handling: to provide more context to the user in the case of invalid expressions. |
|
Hey there all, I'm very interested in getting this PR merged ASAP. Is there any remaining work on this to get it merged? |
|
@noahfournier Sorry. The project is lacking review bandwidth at the moment, so we have to prioritize work and this might unfortunately take some time. |
bc98911 to
b887512
Compare
|
I'd like to revive this as it has been an ask for some time and I think it is important. The technical issues of how the parser is created are probably more minor than the maintenance issue of making sure we come up with an expression syntax we are willing to support and expect to last. There was a ML discussion on this but I feel it stalled out somewhat. Part of the challenge is that there were two alternatives proposed. Another challenge is that it would be unfortunate to adopt one standard in Arrow only to have Substrait adopt a different standard later. I propose the following:
|
| FieldRef -> Field | Field FieldRef | ||
| Field -> . Name | [ Number ] | ||
| Literal -> $ TypeName : Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing rules for escaping right? I think those need to be part of the grammar.
|
There is a discussion on the Substrait mailing list about defining an expression language as part of a text serialization format for Substrait: https://groups.google.com/g/substrait/c/iCiQR-tHI4Q/m/slzrzdcQAgAJ Substrait seems like a more appropriate and sustainable place to define an expression language, maintain and version it over time, handle forward and backward compatibility considerations across versions, etc. Of course we will still need Arrow libraries to implement parsers for the expression language. Could the work in this PR be adapted to parse expressions in a language along the lines of what is proposed in that thread on the Substrait mailing list? |
|
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
while the substrait design might be the correct direction to go, but I feel that's a much broader scope compared with this. this PR could bring the preliminaries of filtering into arrow, so some user requests could be fulfilled. and when the substrait integration is mature and complete, this can be switched at that point. all in all, folks, @ianmcook @amol- @westonpace any chance this gets revived and get merged? |
I've seen a few people on the mailing list asking for something like this and I've wanted it myself, so I went ahead and implemented a parser for a lisp-like way of generating Expressions. Calls are of the form
(<fn> <args>), scalars are of the form$type:value, and field refs are of the form!<dot path>.