Skip to content

Conversation

@save-buffer
Copy link
Contributor

I've seen a few people on the mailing list asking for something like this and I've wanted it myself, so I went ahead and implemented a parser for a lisp-like way of generating Expressions. Calls are of the form (<fn> <args>), scalars are of the form $type:value, and field refs are of the form !<dot path>.

@github-actions
Copy link

github-actions bot commented Oct 1, 2022

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@save-buffer save-buffer changed the title Implement a parser for Expressions ARROW-17906: [C++][Compute] Implement a parser for Expressions Oct 1, 2022
@github-actions
Copy link

github-actions bot commented Oct 1, 2022

@github-actions
Copy link

github-actions bot commented Oct 1, 2022

⚠️ Ticket has no components in JIRA, make sure you assign one.

@save-buffer save-buffer force-pushed the sasha_parser branch 2 times, most recently from a8bfc6e to 6055c3f Compare October 3, 2022 19:20
@westonpace westonpace self-requested a review October 4, 2022 08:57
@pitrou
Copy link
Member

pitrou commented Oct 5, 2022

If this is meant to be a public API, then I'd expect a design discussion on the ML.

@save-buffer
Copy link
Contributor Author

Good point, I'll start a discussion.

@save-buffer save-buffer changed the title ARROW-17906: [C++][Compute] Implement a parser for Expressions ARROW-17351: [C++][Compute] Implement a parser for Expressions Oct 10, 2022
@github-actions
Copy link

@save-buffer
Copy link
Contributor Author

@pitrou @westonpace I've updated the parser to reflect the discussion on the mailing list. The language now looks like the traditional function call syntax instead of the lisp-style syntax. I've also gotten rid of the ! needed before FieldRefs (now it's either .name or [idx]). I've kept the syntax for literals the same

add(.a, $int32:1)

@save-buffer save-buffer force-pushed the sasha_parser branch 2 times, most recently from d2dcad3 to 0cfb4b9 Compare October 27, 2022 23:21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the parser looks simple now, but I would like to ask if you had considered using a parser generator instead (for future maintainability)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that parser generators are more maintainable. They require learning a new syntax, adding an extra build step, adding an extra dependency to the project, and require debugging generated code. In my experience writing parsers, it's always fewer lines of code and simpler to use to hand-write the parser.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but this seems straightforward enough.

Some overall thoughts / questions:

Would this support field names or string literals with non-ASCII characters? I think it is ok to force the control characters to be culture-invariant ASCII (e.g. like JSON does) but I don't think we can restrict field names or literals.

Actually, given the lack of control characters surrounding literals, would this support string literals like $string:my string literal with, and )?

The error messages could be improved to include the exact index of the error, and then some truncated preview before and after the error position. For example:

Error at index 17: expected a close parentheses on a function call...
...pper(x + $stri...
          ^

Even if we don't want to go the parser generator route, do we want to include a grammar for others? Having a grammar could be useful for integration into other languages / implementations or for building auto-complete utilities in interactive expression editors.

@save-buffer
Copy link
Contributor Author

Added better error handling and support for escaping stuff with "".

@save-buffer save-buffer requested review from pitrou and westonpace and removed request for pitrou and westonpace November 23, 2022 02:50
@aucahuasi
Copy link
Contributor

LGTM! For the next PRs, would be great to invest a bit to improve even more the error handling: to provide more context to the user in the case of invalid expressions.
For instance, for this invalid expression:
add($duration(MILLI):10, $duration(MILLIa):20)
we get this error:
'_error_or_value82.status()' failed with Invalid: Error at index 43: Unterminated data type arg list!
it would be nice to get a more precise error message here.
btw I built this on a macbook pro m1 without issues (clang compiler)

@noahfrn
Copy link
Contributor

noahfrn commented Dec 15, 2022

Hey there all, I'm very interested in getting this PR merged ASAP. Is there any remaining work on this to get it merged?

@pitrou
Copy link
Member

pitrou commented Dec 15, 2022

@noahfournier Sorry. The project is lacking review bandwidth at the moment, so we have to prioritize work and this might unfortunately take some time.

@westonpace
Copy link
Member

I'd like to revive this as it has been an ask for some time and I think it is important. The technical issues of how the parser is created are probably more minor than the maintenance issue of making sure we come up with an expression syntax we are willing to support and expect to last.

There was a ML discussion on this but I feel it stalled out somewhat. Part of the challenge is that there were two alternatives proposed. Another challenge is that it would be unfortunate to adopt one standard in Arrow only to have Substrait adopt a different standard later. I propose the following:

  • Build up a corpus of example expressions (10-20 or so) that demonstrate the various features (different types of scalars, escaping strings, etc.)
  • Create a grammar for all proposals (I believe this will help when communicating)
  • Send a message to the Substrait mailing list with the proposal
  • Revive the Arrow ML discussion and point any interested parties to the Substrait discussion
  • Once the Substrait discussion reaches consensus we can merge a parser into arrow-c++

FieldRef -> Field | Field FieldRef
Field -> . Name | [ Number ]
Literal -> $ TypeName : Value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing rules for escaping right? I think those need to be part of the grammar.

@ianmcook
Copy link
Member

There is a discussion on the Substrait mailing list about defining an expression language as part of a text serialization format for Substrait: https://groups.google.com/g/substrait/c/iCiQR-tHI4Q/m/slzrzdcQAgAJ

Substrait seems like a more appropriate and sustainable place to define an expression language, maintain and version it over time, handle forward and backward compatibility considerations across versions, etc. Of course we will still need Arrow libraries to implement parsers for the expression language. Could the work in this PR be adapted to parse expressions in a language along the lines of what is proposed in that thread on the Substrait mailing list?

@noahfrn noahfrn mentioned this pull request Jan 26, 2023
@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
@zinking
Copy link

zinking commented Jul 31, 2023

There is a discussion on the Substrait mailing list about defining an expression language as part of a text serialization format for Substrait: https://groups.google.com/g/substrait/c/iCiQR-tHI4Q/m/slzrzdcQAgAJ

Substrait seems like a more appropriate and sustainable place to define an expression language, maintain and version it over time, handle forward and backward compatibility considerations across versions, etc. Of course we will still need Arrow libraries to implement parsers for the expression language. Could the work in this PR be adapted to parse expressions in a language along the lines of what is proposed in that thread on the Substrait mailing list?

while the substrait design might be the correct direction to go, but I feel that's a much broader scope compared with this. this PR could bring the preliminaries of filtering into arrow, so some user requests could be fulfilled.

and when the substrait integration is mature and complete, this can be switched at that point.

all in all, folks, @ianmcook @amol- @westonpace any chance this gets revived and get merged?

@danepitkin
Copy link
Member

Hi @zinking , I don't think there is any plan to revive this.

FWIW, the Substrait ExtendedExpression support landed in C++ and Python[1]. The Java implementation[2] is in final review stages as well. I believe this is the current preferred approach.

[1] #34834
[2] #35570

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants