-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement basic version of RLIKE #734
Conversation
def isSupportedPattern(pattern: String): Boolean = { | ||
// this is a placeholder for implementing logic to determine if the pattern | ||
// is known to be compatible with Spark, so that we can enable regexp automatically | ||
// for common cases and fallback to Spark for more complex cases | ||
false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan on implementing this in a future PR
Last time, we had a discussion around dictionary #469 (comment) |
Yes, the test now covers the dictionary case (as far as I can tell). The expression is never invoked with a dictionary though , so it seems to be converted before reaching the expression. |
val data = Seq("James Smith", "Michael Rose", "Rames Rose", "Rames rose") ++ | ||
gen.generateStrings(100, "rames Rose", 12) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need enough number of repeated strings in order to make a dictionary. Right now, it has a low chance due to the randomness...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kazuyukitanimura I updated the test and have now added dictionary support
microbenchmark results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(cherry picked from commit e33d560)
Which issue does this PR close?
N/A
Rationale for this change
RLIKE is pretty common in ETL jobs.
What changes are included in this PR?
This PR implements the RLIKE expression, but falls back to Spark for all regexp patterns that are not currently guaranteed to be compatible (which is all of them for now) unless a new config (
spark.comet.regexp.allowIncompatible
) is enabled.How are these changes tested?
New tests