Skip to content

Implement method to apply scalar or aggregate function to Array elements #15882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
timsaucer opened this issue Apr 28, 2025 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@timsaucer
Copy link
Contributor

Is your feature request related to a problem or challenge?

Suppose I have an DataFrame in which one column contains arrays. I wish to be able to apply any scalar expr to each value of that array and return an array out. For example I would like to be able to apply an abs() function and convert data such as this:

DataFrame()
+--------------+-------------+
| a            | abs(a)      |
+--------------+-------------+
| [-10, 5, 13] | [10, 5, 13] |
| [2]          | [2]         |
| [-3, 1]      | [3, 1]      |
+--------------+-------------+

Additionally it would be amazing to be able to apply any aggregate function to an array element.

DataFrame()
+--------------+--------+
| a            | sum(a) |
+--------------+--------+
| [-10, 5, 13] | 8      |
| [2]          | 2      |
| [-3, 1]      | 2      |
+--------------+--------+

Describe the solution you'd like

This is similar to the spark transform operation. It is very powerful for highly structured data. I don't know the best form that that functions would take, but it would be even more powerful if we could do element-by-element operations across more than one column in the dataframe. There are many use cases where you will have columns of array elements of the same length.

Describe alternatives you've considered

The current status quo is to either write a UDF to handle these on a case by case basis or to do an unnest and group by. The unnest and group by can be an expensive operation.

Additional context

No response

@timsaucer timsaucer added the enhancement New feature or request label Apr 28, 2025
@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

I bet something like this already exists in datafusion-functions-array crate -- figuring out how to make it general would be very sweet

@KR-bluejay
Copy link

@alamb

Following your comment about making array functions more general, I suggest we create a common directory for array operations to reduce code duplication:

datafusion/common/src/array/

We could implement each operation in a separate file:

  • array_sum.rs
  • array_abs.rs
  • array_min.rs
  • array_max.rs

This would provide a consistent interface for both element-wise operations and aggregations on arrays, making the codebase more maintainable and easier to extend with new array functions in the future.

I'd like to work on implementing this approach. What do you think about organizing array operations this way? If this approach seems reasonable, I'm happy to start working on it.

@alamb
Copy link
Contributor

alamb commented May 9, 2025

I'd like to work on implementing this approach. What do you think about organizing array operations this way? If this approach seems reasonable, I'm happy to start working on it.

It seems like a reasonable idea to me. I think we would have to see how the code looks -- I am not familiar enough with its current structure to know how big a change this is

However, I think @timsaucer is asking for something different: not a particular array function like array_sum but instead something like array_appy(array, func)

Which woudl take an array and function and apply func to each distinct sub array

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants