-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf: Allow User defined functions to potentially reuse their argument arrays (to avoid new allocations) #13516
Comments
I think @joseph-isaacs has done this in #13491 |
For closing this issue could we have an example showing how to do this reuse and proving that it really takes place? |
This is a good idea -- I agree an example should be written |
Are you sure? I took a look at ScalarFunctionArgs in main and it seems to use |
Good call @Omega359 -- I think it was removed eventually. I'll try and make the change with an example over the next few days/weeks |
Here is a PR that implements the change and adds an example: |
Is your feature request related to a problem or challenge?
Arrow Arrays are designed to be immutable and use shared references extensively, but it is possible to reuse the underlying buffer in some cases when there are no other references (see the arrow unary_mut kernel for example)
At the time of writing, DataFusion scalar functions (
ScalarFunctionImpl
must always allocate a new array when generating output. They can not reuse the existing underlying memory, even if the source array will never be used againThis is because the invoke signature gets the arguments as reference (slice of
ColumnarValue
) rather than by ownershipFor example, an expression like
(a + b) + c
will be evaluated likea + b
-->temp_array
temp_array + c
-->result_array
Resulting in two new allocations
Describe the solution you'd like
It would be really nice if it were possible to evaluate
(a + b) + c
like this (with no new allocations)a + b
-->a
(write output toa
, reusing allocation)a + c
-->a
(now add c, also reusing allocation)And the result would be a new array that re-used the original allocation of the
a
arrayDescribe alternatives you've considered
Now that this is merged
ScalarUDFImpl::invoke_with_args
to support passing the return type created for the udf instance #13290 (thanks @joseph-isaacs)I think we can make it possible in the future to reuse allocations by changing what is passed into
ScalarFunctionArgs
Since we haven't yet released a version with
ScalarFunctionArgs
we can change its signature without breaking APIs until DataFusion 44 is releasedAdditional context
I have a draft of the basic idea here:
The text was updated successfully, but these errors were encountered: