-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Unify Function Interface (remove BuiltInScalarFunction
)
#8045
Comments
Would like to add that supporting serialization of user-defined functions would be quite nice. The current approach to serialization is to basically just use a string and so if you have a scalar udf with some constant parameter you end up having to encode the constant as a literal expression which is less than ideal. Providing a mechanism (through |
I don't understand this question @thinkharderdev 🤔 The current approach to serialize a And then the physical plan deserialization looks that function up by name https://docs.rs/datafusion-proto/latest/src/datafusion_proto/physical_plan/from_proto.rs.html#318-333 The only difference compared to |
Sorry, what I mean is that it would be useful to be able to serialize constant parameters into the user-defined scalar function themselves rather than pass them in as expressions. So for instance if you had to create a UDF to do something with a regex that you have as a static constant. Currently the way to do that is pass it as a literal expression. But then you have to compile the regex again for every batch you process through the UDF. Ideally you could have something like:
The regex would only need to be compiled once during deserialization (or construction) instead of once for each batch. |
This makes total sense -- I filed #8051 to track the issue |
Here is a PR that shows what DataFusion might look like after removing |
According to the previous prototyping in #7978, we might need to do several cleanups towards this issue. Following 2 is necessary for separating
|
Thank you for the investigation @2010YOUY01 This gap may be one of the gaps we are observing in IOx from a recent update from @mustafasrepo and @ozankabak in #8006 / https://github.com/apache/arrow-datafusion/issues/8043🤔 I will try and make it a priority to get a PR up to encapsuate the |
This is a similar usecase to what @thinkharderdev describes on #8051 -- I wrote up some ideas #8051 (comment) |
@2010YOUY01 -- I totally agree this could be cleaned up. Moving it to specific implementations would be great. Could you make a PR to do so? |
@2010YOUY01 I have updated the task list on this ticket based on your investigation. Please take a look when you have a chance. The only one I don't understand is
Can you explain why validation of input types + output type calculation need to be separated? |
I think it's a simple refactor, I can open a PR to better explain and solve this issue
Sure |
Thanks for driving this forward @2010YOUY01 -- it is very much appreciated. I am planning to merge #8079 in 4 more days (after it has been open for a week to ensure anyone who wants has had a chance to review it) In the mean time, I was thinking what other work could be done while we are waiting. I see two options:
What do you think? Is any of that interesting to you? |
I'm going to work on 2 in the next few days. I think this |
Thanks! one thought I had was to introduce a new Expr:: variant, and port one or two of the expr_fn's to show how it works and the the first PR through. Then as a second PR we can port over the remaining expr_fns (which will be a larger PR but very mechanical) |
Is your feature request related to a problem or challenge?
This is based on the wonderful writeup from @2010YOUY01 in #7977
As previously discussed in #7110 #7752 there are a few challenges with how ScalarFunctions are handled, notably that there are two distinct implementations --
BuiltinScalarFunction
andScalarUDF
Problems with
BuiltinScalarFunction
Enum BuiltinScalarFunction
, and function implementations likereturn_type()
are large methods that match every enum variant.Problems with
ScalarUDF
ScalarUDF
s is a struct, and does not cover all the functionalities of existing built-in functionsScalarUDF
requires constructing a struct in an imperative way providingArc
function pointers (see examples/simple_udf.rs) for each part of the UDF, which is not familiar to Rust users where it is more common to seedyn Trait
objectsDescribe the solution you'd like
I propose moving DataFusion to only use
ScalarUDF
s and removeBuiltInScalarFunction
. This will ensure:We will keep the existing
ScalarUDF
interface as much as possible, while also potentially providing an easier way to define them (ideally via a trait object).Describe alternatives you've considered
#7977 describes introducing a new trait and unifying both ScalarUDF and BuiltInScalarFunction with this trait.
This approach also allows gradually migrating existing built-in functions to the new one, the old UDF interface
create_udf()
can keep unchanged.However, I think it is a bigger change for users, and has the danger of making the overall complexity of DataFusion worse. As demonstrated in #8046 it is also feasible to allow new
ScalarUDF
s to be defined using a trait while retaining backwards compatibility for existingScalarUDF
implementationsAdditional context
Proposed implementation steps:
pub
): RFC: Make fields of ScalarUDF non pub #8039ScalarUDF
API changes for real: Make fields ofScalarUDF
,AggregateUDF
andWindowUDF
nonpub
#8079Expr::AggregateFunction
andExpr::AggregateUDF
#8346expr::window_function::WindowFunction
toWindowFunctionDefintion
for consistency #8347Expr
creation forScalarUDF
: Resolve function calls by name during planning #8157ScalarUDFImpl
(rather than the function pointers) #8712datafusion-function
crate with an initial set of functions as a model (see RFC: Demonstrate what a function package might look like -- encoding expressions #8046)datafusion_functions
crate, file tickets to track them ([Epic] Port BuiltInFunctons todatafusion-functions-*
crates #9285)datafusion-functions-*
crates #9285FunctionRegistry::register_udaf
andFunctionRegistry::register_udwf
#9074AggregateUDF
[Epic] UnifyAggregateFunction
Interface (remove built in list ofAggregateFunction
s), improve the system #8708 andWindowUDF
[Epic] UnifyWindowFunction
Interface (remove built in list ofBuiltInWindowFunction
s) #8709The text was updated successfully, but these errors were encountered: