You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
Suppose I have an DataFrame in which one column contains arrays. I wish to be able to apply any scalar expr to each value of that array and return an array out. For example I would like to be able to apply an abs() function and convert data such as this:
This is similar to the spark transform operation. It is very powerful for highly structured data. I don't know the best form that that functions would take, but it would be even more powerful if we could do element-by-element operations across more than one column in the dataframe. There are many use cases where you will have columns of array elements of the same length.
Describe alternatives you've considered
The current status quo is to either write a UDF to handle these on a case by case basis or to do an unnest and group by. The unnest and group by can be an expensive operation.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Following your comment about making array functions more general, I suggest we create a common directory for array operations to reduce code duplication:
datafusion/common/src/array/
We could implement each operation in a separate file:
array_sum.rs
array_abs.rs
array_min.rs
array_max.rs
This would provide a consistent interface for both element-wise operations and aggregations on arrays, making the codebase more maintainable and easier to extend with new array functions in the future.
I'd like to work on implementing this approach. What do you think about organizing array operations this way? If this approach seems reasonable, I'm happy to start working on it.
I'd like to work on implementing this approach. What do you think about organizing array operations this way? If this approach seems reasonable, I'm happy to start working on it.
It seems like a reasonable idea to me. I think we would have to see how the code looks -- I am not familiar enough with its current structure to know how big a change this is
However, I think @timsaucer is asking for something different: not a particular array function like array_sum but instead something like array_appy(array, func)
Which woudl take an array and function and apply func to each distinct sub array
Is your feature request related to a problem or challenge?
Suppose I have an DataFrame in which one column contains arrays. I wish to be able to apply any scalar expr to each value of that array and return an array out. For example I would like to be able to apply an
abs()
function and convert data such as this:Additionally it would be amazing to be able to apply any aggregate function to an array element.
Describe the solution you'd like
This is similar to the spark
transform
operation. It is very powerful for highly structured data. I don't know the best form that that functions would take, but it would be even more powerful if we could do element-by-element operations across more than one column in the dataframe. There are many use cases where you will have columns of array elements of the same length.Describe alternatives you've considered
The current status quo is to either write a UDF to handle these on a case by case basis or to do an unnest and group by. The unnest and group by can be an expensive operation.
Additional context
No response
The text was updated successfully, but these errors were encountered: