-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is pre-compile pattern string in regexp_match operation #11146
Comments
Hi @zhuliquan -- I believe you are correct that the regexp is re-compiled for each batch. This is mentioned as one of the use cases for #8051 I think it would be a nice improvement to look into pre-compiling the regexp somehow, though last time I checked the only benchmark we have (one of the clickbench queries) compiling the regexp was not a significant consumer of time |
I think should pre-compiled when create binary-physical-expr, instead of evaluating record_batch. datafusion/datafusion/physical-expr/src/expressions/binary.rs Lines 60 to 66 in 57280e4
|
I think this makes sense to me. Thank you One thing to look into might be to replace the regexp match operator with a function (and have the function's implementation store the precompiled regexp) One tricky part might be serialization (in datafusion-proto). We can probably handle it via an extension codec or something. |
Is your feature request related to a problem or challenge?
I noticed below code:
datafusion/datafusion/physical-expr/src/expressions/binary.rs
Lines 537 to 564 in 6a4a280
This looks like every time record_batch is evaluated, it will execute the compiled pattern string and use the compiled results to match arrow-array
Describe the solution you'd like
when building binary physical expr , we can pre-compile pattern string if op is regex_match
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: