Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized filter kernel for regular expression matching #697

Closed
alamb opened this issue Aug 18, 2021 · 2 comments · Fixed by #706
Closed

Add optimized filter kernel for regular expression matching #697

alamb opened this issue Aug 18, 2021 · 2 comments · Fixed by #706
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Aug 18, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In apache/datafusion#870, @b41sh added support for filtering all values that do/do not match a particular regular expression. However, it uses the (only available at time of writing) regexp_match kernel which returns any actual matches (as a ListArray) rather than just a "true/false" (BooleanArray) if the row matched or not. This is unoptimal because:

  1. It is more work to construct a ListArray than a BooleanArray
  2. There is extra work to then turn the ListArray back into a BooleanArray

Describe the solution you'd like
Add an arrow compute kernel (perhaps in the comparison module) that looks like

A better name TBD -- regexp_matches_utf8 is similar to like_utf8 but also perhaps too similar to regexp_match

pub fn regexp_matches_utf8<OffsetSize: StringOffsetSizeTrait>(
    array: &GenericStringArray<OffsetSize>, 
    regex_array: &GenericStringArray<OffsetSize>, 
    flags_array: Option<&GenericStringArray<OffsetSize>>
) -> Result<BooleanArray>

Where the resulting BooleanArray is

  • true if there was 1 or more matches of the regex array/flags
  • false if there were 0 matches of the regex array/flags
  • NULL if the input or regexp array was null (make them the same null semantics as regex_match and like_utf8)

Describe alternatives you've considered
None yet

Additional context
See use in apache/datafusion#870

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Aug 18, 2021
@b41sh
Copy link
Contributor

b41sh commented Aug 18, 2021

@alamb This is a good idea, the previous implementation is really not good enough. I'd like to work on this.

@alamb
Copy link
Contributor Author

alamb commented Aug 18, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants