Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva #8844

Closed
wants to merge 1 commit into from

Conversation

wjones127
Copy link
Member

Was looking for "rlike" support in Gandiva and found we had nearly implementing it. This PR revives #5860.

@wjones127 wjones127 changed the title ARROW-7205 : [C++][Gandiva] Implemente regexp_like in Gandiva ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva Dec 6, 2020
@github-actions
Copy link

github-actions bot commented Dec 6, 2020

@wjones127 wjones127 marked this pull request as ready for review December 7, 2020 05:06
@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 23, 2020
@wjones127 wjones127 force-pushed the ARROW-7205/gandiva-rlike branch from 960406e to 6a0f5cb Compare January 1, 2021 20:12
@wjones127
Copy link
Member Author

@pravindra @Praveen2112 @projjal Got some time for a review? 😁

@projjal
Copy link
Contributor

projjal commented Mar 31, 2021

Hi @wjones127. Looks like creating a base holder class and two derived class for just these two functions seems overkill to me (I think thats why I closed the original pr temporarily). The only difference in logic between the two functions is that in case of "like" we convert the sql pattern to regex pattern before storing the compiled pattern. Would it make sense if we pass a flag to the holder that says if this is sql pattern or regex patttern?

@wjones127
Copy link
Member Author

Projjal, that's a good point. The SQL like function is basically a wrapper around regexp_like. I don't know why I didn't see the optimizations should be the same.

I will consolidate them into a single LikeHolder.

@wjones127 wjones127 force-pushed the ARROW-7205/gandiva-rlike branch from a565f19 to 41133b9 Compare April 2, 2021 00:43
std::shared_ptr<LikeHolder>* holder) {
std::string pcre_pattern;
ARROW_RETURN_NOT_OK(RegexUtil::SqlLikePatternToPcre(sql_pattern, pcre_pattern));
Result<std::string> RegexpMatchesHolder::GetPattern(const FunctionNode& node) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI This looks to be the first use of Arrow::Result in the Gandiva codebase, but this seems to be what the main arrow C++ codebase is using recently. Let me know if you object.

@@ -29,7 +29,7 @@ namespace gandiva {
/// \brief Utility class for converting sql patterns to pcre patterns.
class GANDIVA_EXPORT RegexUtil {
public:
// Convert an sql pattern to a pcre pattern
// Convert an sql pattern to a pcre pattern for use with PartialMatch
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, like was using RE2::FullMatch. However, regexp_matches needs to use RE2::PartialMatch, so I've modified this method to make partial match statements. I have added a new set of tests to validate it.

I didn't see any use of this utility method outside of the like implementation, so I thought this change would be okay.

// Escape any char that is special for pcre regex
if (pcre_regex_specials_.find(cur) != pcre_regex_specials_.end()) {
if (pcre_regex_specials_.find(cur) != pcre_regex_specials_.end() && cur != escape_char) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change enables using \ as an escape character in LIKE statements. For example, col like '\_%' can match anything starting with an underscore.

@wjones127
Copy link
Member Author

I think we should merge #9700 before this one, since my refactor of LikeHolder would cause merge conflicts on that PR. So I'll wait for that merge and then finish this.

@nealrichardson
Copy link
Member

I think we should merge #9700 before this one, since my refactor of LikeHolder would cause merge conflicts on that PR. So I'll wait for that merge and then finish this.

@wjones127 #9700 has merged, are you planning to pick this back up?

@wjones127
Copy link
Member Author

I think we should merge #9700 before this one, since my refactor of LikeHolder would cause merge conflicts on that PR. So I'll wait for that merge and then finish this.

@wjones127 #9700 has merged, are you planning to pick this back up?

Yes, I'll work on it this weekend.

@wjones127 wjones127 force-pushed the ARROW-7205/gandiva-rlike branch from 55d96b3 to a21742f Compare July 25, 2021 18:10
@wjones127
Copy link
Member Author

@projjal This is now ready for review.

Since we've added ilike and special escape_char options for like, I made SQLLikeHolder inherit from RegexMatchHolder instead of sharing the same holder class. I did make sure the TryOptimize logic wasn't duplicated.

fixed linting issues

fixed check style issues

Fixed some names

Remove extra whitespace

Add substr short-ciruit for regexp_like

Refactor to share logic between like and rlike

Fix style issues

Fix segfault

Fix SqlLikePatternToPcre to use partial matching

Fix style issues and warning

Fix formatting; ran the docker image and used clang-format
@wjones127 wjones127 force-pushed the ARROW-7205/gandiva-rlike branch from a21742f to 7621eed Compare September 4, 2021 20:37
@wjones127 wjones127 closed this Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Component: Gandiva needs-rebase A PR that needs to be rebased by the author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants