-
Notifications
You must be signed in to change notification settings - Fork 0
19240: fix: #10020 make concat function support array concatenation #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -15,8 +15,9 @@ | |||||||||||||||
| // specific language governing permissions and limitations | ||||||||||||||||
| // under the License. | ||||||||||||||||
|
|
||||||||||||||||
| use arrow::array::{Array, as_largestring_array}; | ||||||||||||||||
| use arrow::datatypes::DataType; | ||||||||||||||||
| use arrow::array::{Array, as_largestring_array, ArrayRef}; | ||||||||||||||||
| use arrow::datatypes::{DataType, Field, TimeUnit}; | ||||||||||||||||
| use arrow_array::types::*; | ||||||||||||||||
| use datafusion_expr::sort_properties::ExprProperties; | ||||||||||||||||
| use std::any::Any; | ||||||||||||||||
| use std::sync::Arc; | ||||||||||||||||
|
|
@@ -26,10 +27,10 @@ use crate::strings::{ | |||||||||||||||
| ColumnarValueRef, LargeStringArrayBuilder, StringArrayBuilder, StringViewArrayBuilder, | ||||||||||||||||
| }; | ||||||||||||||||
| use datafusion_common::cast::{as_string_array, as_string_view_array}; | ||||||||||||||||
| use datafusion_common::{Result, ScalarValue, internal_err, plan_err}; | ||||||||||||||||
| use datafusion_expr::expr::ScalarFunction; | ||||||||||||||||
| use datafusion_common::{Result, ScalarValue, internal_err, plan_err, exec_err}; | ||||||||||||||||
| use datafusion_expr::expr::{ScalarFunction, ScalarFunctionExpr}; | ||||||||||||||||
| use datafusion_expr::simplify::{ExprSimplifyResult, SimplifyInfo}; | ||||||||||||||||
| use datafusion_expr::{ColumnarValue, Documentation, Expr, Volatility, lit}; | ||||||||||||||||
| use datafusion_expr::{ColumnarValue, Documentation, Expr, Volatility, lit, BuiltinScalarFunction}; | ||||||||||||||||
| use datafusion_expr::{ScalarFunctionArgs, ScalarUDFImpl, Signature}; | ||||||||||||||||
| use datafusion_macros::user_doc; | ||||||||||||||||
|
|
||||||||||||||||
|
|
@@ -67,13 +68,42 @@ impl ConcatFunc { | |||||||||||||||
| pub fn new() -> Self { | ||||||||||||||||
| use DataType::*; | ||||||||||||||||
| Self { | ||||||||||||||||
| signature: Signature::variadic( | ||||||||||||||||
| vec![Utf8View, Utf8, LargeUtf8], | ||||||||||||||||
| Volatility::Immutable, | ||||||||||||||||
| signature: Signature::variadic_any(Volatility::Immutable) | ||||||||||||||||
| ), | ||||||||||||||||
| } | ||||||||||||||||
| } | ||||||||||||||||
| } | ||||||||||||||||
|
|
||||||||||||||||
| // Check if any argument is an array type | ||||||||||||||||
| fn has_array_args(args: &[ColumnarValue]) -> bool { | ||||||||||||||||
| use arrow::datatypes::DataType::*; | ||||||||||||||||
| args.iter().any(|arg| match arg { | ||||||||||||||||
| ColumnarValue::Array(arr) => matches!(arr.data_type(), List(_)), | ||||||||||||||||
| ColumnarValue::Scalar(scalar) => matches!(scalar, ScalarValue::List(_, _)), | ||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This pattern won’t match: 🤖 Was this useful? React with 👍 or 👎
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes check only for the ScalarValue::List type but it should also check the other List types - LargeList and FixedSizeList, otherwise the solution won't be complete. |
||||||||||||||||
| }) | ||||||||||||||||
| } | ||||||||||||||||
|
Comment on lines
+76
to
+83
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Find the ScalarValue enum definition
rg -n "enum ScalarValue" --type rust -A 50 | head -100Repository: martin-augment/datafusion Length of output: 3784 🏁 Script executed: #!/bin/bash
# Search for ScalarValue::List pattern usage in the codebase
rg "ScalarValue::List\(" --type rust -B2 -A2 | head -50Repository: martin-augment/datafusion Length of output: 3735 🏁 Script executed: #!/bin/bash
# Look specifically in arrow crate or where ScalarValue is defined
fd -e rs | xargs rg -l "enum ScalarValue" | head -5Repository: martin-augment/datafusion Length of output: 104 🏁 Script executed: #!/bin/bash
# Get the full ScalarValue enum definition including List variant
sed -n '324,450p' datafusion/common/src/scalar/mod.rsRepository: martin-augment/datafusion Length of output: 5239 Fix incorrect pattern matching for The pattern 🤖 Prompt for AI Agents |
||||||||||||||||
|
|
||||||||||||||||
| // Convert arguments to array_concat function | ||||||||||||||||
| fn to_array_concat(args: Vec<Expr>) -> Result<Expr> { | ||||||||||||||||
| let array_concat = BuiltinScalarFunction::ArrayConcat; | ||||||||||||||||
| let args = args.into_iter() | ||||||||||||||||
| .map(|arg| { | ||||||||||||||||
| // If the argument is not already an array, wrap it in an array | ||||||||||||||||
| if !matches!(arg.get_type(&DataType::Null).unwrap_or_default(), DataType::List(_)) { | ||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using 🤖 Was this useful? React with 👍 or 👎
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed code does not even compile. The expected parameter is |
||||||||||||||||
| Ok(Expr::ScalarFunction(ScalarFunction::new( | ||||||||||||||||
| BuiltinScalarFunction::MakeArray, | ||||||||||||||||
| vec![arg], | ||||||||||||||||
| ))) | ||||||||||||||||
| } else { | ||||||||||||||||
| Ok(arg) | ||||||||||||||||
| } | ||||||||||||||||
| }) | ||||||||||||||||
| .collect::<Result<Vec<_>>>()?; | ||||||||||||||||
|
|
||||||||||||||||
| Ok(Expr::ScalarFunction(ScalarFunction::new( | ||||||||||||||||
| array_concat, | ||||||||||||||||
| args, | ||||||||||||||||
| ))) | ||||||||||||||||
| } | ||||||||||||||||
|
Comment on lines
+85
to
+106
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, let's examine the actual file and understand the context
cd datafusion/functions/src/string
wc -l concat.rsRepository: martin-augment/datafusion Length of output: 82 🏁 Script executed: # Read the specific lines mentioned in the review
sed -n '85,106p' concat.rsRepository: martin-augment/datafusion Length of output: 121 🏁 Script executed: # Check if BuiltinScalarFunction exists and how it's used
rg -n "BuiltinScalarFunction" --type rust -C 2 datafusion/ | head -150Repository: martin-augment/datafusion Length of output: 1597 🏁 Script executed: # Check Expr::get_type signature
rg -n "fn get_type" --type rust -A 3 datafusion/ | head -100Repository: martin-augment/datafusion Length of output: 2945 🏁 Script executed: # Check ScalarFunction::new signature and usage
rg -n "ScalarFunction::new" --type rust -C 2 datafusion/ | head -100Repository: martin-augment/datafusion Length of output: 8058 🏁 Script executed: # Check how array_concat is currently implemented
rg -n "array_concat" --type rust -C 3 datafusion/ | head -80Repository: martin-augment/datafusion Length of output: 6547 🏁 Script executed: # Check for MakeArray implementation
rg -n "MakeArray" --type rust -C 3 datafusion/ | head -100Repository: martin-augment/datafusion Length of output: 2933 🏁 Script executed: # Check the ScalarFunction::new signature
rg -n "fn new\(" --type rust datafusion/expr/src/expr/ -A 2 | grep -A 2 "ScalarFunction"Repository: martin-augment/datafusion Length of output: 134 🏁 Script executed: # Verify if ScalarFunction::new still exists with BuiltinScalarFunction
rg -n "impl ScalarFunction" --type rust datafusion/expr/src/expr/ -A 10Repository: martin-augment/datafusion Length of output: 134 🏁 Script executed: # Check what the correct approach should be - look at array_concat and make_array helper functions
rg -n "pub fn array_concat" --type rust datafusion/functions-nested/src/ -A 5Repository: martin-augment/datafusion Length of output: 51 🏁 Script executed: # Check make_array helper function
rg -n "pub fn make_array" --type rust datafusion/functions-nested/src/ -A 5Repository: martin-augment/datafusion Length of output: 51 🏁 Script executed: # Find ScalarFunction definition
rg -n "struct ScalarFunction" --type rust -A 5 datafusion/Repository: martin-augment/datafusion Length of output: 1965 🏁 Script executed: # Find the public helper functions in functions-nested
find datafusion/functions-nested/src -name "*.rs" -type f | head -20Repository: martin-augment/datafusion Length of output: 939 🏁 Script executed: # Check expr_fn module for public helpers
rg -n "pub fn array_concat\|pub fn make_array" --type rust datafusion/functions-nested/Repository: martin-augment/datafusion Length of output: 51 🏁 Script executed: # Check what's in expr_fn.rs
cat datafusion/functions-nested/src/expr_fn.rs | head -100Repository: martin-augment/datafusion Length of output: 143 🏁 Script executed: # Find ScalarFunction impl and its methods
rg -n "impl ScalarFunction" --type rust -A 30 datafusion/expr/src/expr.rsRepository: martin-augment/datafusion Length of output: 1287 🏁 Script executed: # Check functions-nested lib.rs for public exports
cat datafusion/functions-nested/src/lib.rs | head -150Repository: martin-augment/datafusion Length of output: 5356 🏁 Script executed: # Check the macros to understand make_udf_expr_and_func
rg -n "make_udf_expr_and_func" --type rust -A 20 datafusion/functions-nested/src/macros.rs | head -80Repository: martin-augment/datafusion Length of output: 1936 🏁 Script executed: # Double-check: search for any usage of ScalarFunction::new with non-new_udf pattern
rg -n "ScalarFunction::new\(" --type rust datafusion/expr/src/expr.rsRepository: martin-augment/datafusion Length of output: 51 🏁 Script executed: # Verify the old BuiltinScalarFunction is still in codebase but unused in new patterns
rg -n "enum BuiltinScalarFunction" --type rust datafusion/ -A 5Repository: martin-augment/datafusion Length of output: 51 This code will not compile—multiple critical API mismatches.
Fix: Use the public helper functions from
🤖 Prompt for AI Agents |
||||||||||||||||
|
|
||||||||||||||||
| impl ScalarUDFImpl for ConcatFunc { | ||||||||||||||||
|
Comment on lines
+106
to
108
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing closing brace for The Ok(Expr::ScalarFunction(ScalarFunction::new(
array_concat,
args,
)))
}
+}
impl ScalarUDFImpl for ConcatFunc {📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||
| fn as_any(&self) -> &dyn Any { | ||||||||||||||||
|
|
@@ -90,22 +120,40 @@ impl ScalarUDFImpl for ConcatFunc { | |||||||||||||||
|
|
||||||||||||||||
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||||||||||||||||
| use DataType::*; | ||||||||||||||||
|
|
||||||||||||||||
| // If any argument is an array, return the array type | ||||||||||||||||
| if arg_types.iter().any(|t| matches!(t, List(_))) { | ||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Array detection in 🤖 Was this useful? React with 👍 or 👎
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes check only for the DataType::List type but it should also check the other List types - LargeList and FixedSizeList, otherwise the solution won't be complete. |
||||||||||||||||
| // Find the first array type to use as the base | ||||||||||||||||
| if let Some(DataType::List(field)) = arg_types.iter().find(|t| matches!(t, List(_))) { | ||||||||||||||||
| return Ok(DataType::List(field.clone())); | ||||||||||||||||
| } | ||||||||||||||||
| } | ||||||||||||||||
|
|
||||||||||||||||
| // Otherwise, use the existing string type logic | ||||||||||||||||
| let mut dt = &Utf8; | ||||||||||||||||
| arg_types.iter().for_each(|data_type| { | ||||||||||||||||
| for data_type in arg_types { | ||||||||||||||||
| if data_type == &Utf8View { | ||||||||||||||||
| dt = data_type; | ||||||||||||||||
| } | ||||||||||||||||
| if data_type == &LargeUtf8 && dt != &Utf8View { | ||||||||||||||||
| dt = data_type; | ||||||||||||||||
| } | ||||||||||||||||
| }); | ||||||||||||||||
|
|
||||||||||||||||
| Ok(dt.to_owned()) | ||||||||||||||||
| } | ||||||||||||||||
| Ok(dt.clone()) | ||||||||||||||||
| } | ||||||||||||||||
|
Comment on lines
121
to
143
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, let's examine the concat.rs file more closely to understand the context
cd datafusion/functions/src/string && wc -l concat.rsRepository: martin-augment/datafusion Length of output: 82 🏁 Script executed: # Check the full concat function implementation to understand the context
cd datafusion/functions/src/string && head -160 concat.rs | tail -60Repository: martin-augment/datafusion Length of output: 2014 🏁 Script executed: # Search for tests related to array concatenation
find . -name "*.rs" -type f -exec grep -l "concat.*List\|array.*concat" {} \; | head -20Repository: martin-augment/datafusion Length of output: 726 🏁 Script executed: # Look for type coercion logic in DataFusion
rg "coerce.*type|type.*coerce" --type rs -A 3 -B 3 | head -100Repository: martin-augment/datafusion Length of output: 95 🌐 Web query:
💡 Result: Short answer: array_concat (and the array/list concat UDFs) require the input arrays to have element types that can be coerced to a single common element type; DataFusion will pick a common promoted/compatible Arrow DataType and (implicitly) cast elements as needed using DataFusion/Arrow cast rules. If types are not coercible the planner will error. Note also the SQL concat operator (string concat) is different — when given list/array arguments it may stringify them instead of performing array concatenation; use array_concat / list_concat for real array merge. [1][2][3][4] Sources:
🏁 Script executed: # Check the proper array_concat implementation in functions-nested
cat datafusion/functions-nested/src/concat.rs | head -200Repository: martin-augment/datafusion Length of output: 5941 🏁 Script executed: # Look for type coercion logic in the planner
rg "coerce_types_for_function|type_coerce" --type rs -A 5 | head -150Repository: martin-augment/datafusion Length of output: 95 🏁 Script executed: # Search for tests related to array concatenation
rg "array_concat|concat.*List" --type rs tests/ | head -30Repository: martin-augment/datafusion Length of output: 95 🏁 Script executed: # Check if there are integration tests for concat with arrays
find . -name "*.rs" -path "*/tests/*" -exec grep -l "array_concat\|concat.*\[" {} \; | head -10Repository: martin-augment/datafusion Length of output: 363 🏁 Script executed: # Get the array_concat implementation in functions-nested
sed -n '200,400p' datafusion/functions-nested/src/concat.rsRepository: martin-augment/datafusion Length of output: 5908 🏁 Script executed: # Search for ArrayConcat struct and return_type implementation
rg "struct ArrayConcat|impl.*ArrayConcat" --type rs -A 30 datafusion/functions-nested/src/concat.rs | head -80Repository: martin-augment/datafusion Length of output: 95 🏁 Script executed: # Get the return_type method from array_concat
rg "fn return_type" --type rs -B 5 -A 20 datafusion/functions-nested/src/concat.rsRepository: martin-augment/datafusion Length of output: 95 🏁 Script executed: # Look at how type coercion is done for arrays
rg "ListCoercion|coerce_types_for_function" --type rs -B 3 -A 10 | head -100Repository: martin-augment/datafusion Length of output: 95 🏁 Script executed: # Get the full string concat implementation
cat datafusion/functions/src/string/concat.rs | head -250Repository: martin-augment/datafusion Length of output: 9541 🏁 Script executed: # Look at the to_array_concat method and how it's used
rg "to_array_concat|has_array_args" --type rs -B 5 -A 15 datafusion/functions/src/string/concat.rsRepository: martin-augment/datafusion Length of output: 95 🏁 Script executed: # Check if there's type coercion happening in the planner for this
rg "concat.*coerce|string.*concat.*type" --type rs datafusion/ | head -20Repository: martin-augment/datafusion Length of output: 95 Type signature mismatch: The 🤖 Prompt for AI Agents |
||||||||||||||||
|
|
||||||||||||||||
| /// Concatenates the text representations of all the arguments. NULL arguments are ignored. | ||||||||||||||||
| /// concat('abcde', 2, NULL, 22) = 'abcde222' | ||||||||||||||||
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { | ||||||||||||||||
| // Check if any argument is an array | ||||||||||||||||
| if Self::has_array_args(&args.args) { | ||||||||||||||||
| // Convert to array_concat expression and evaluate it | ||||||||||||||||
| let exprs: Vec<Expr> = args.args.into_iter().map(|a| a.into_expr()).collect(); | ||||||||||||||||
| let expr = Self::to_array_concat(exprs)?; | ||||||||||||||||
| return expr.eval(args.context); | ||||||||||||||||
| } | ||||||||||||||||
|
|
||||||||||||||||
| // Original string concatenation logic | ||||||||||||||||
| let ScalarFunctionArgs { args, .. } = args; | ||||||||||||||||
|
Comment on lines
147
to
157
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fundamental design issue: Cannot convert This approach has critical flaws:
Consider instead:
#!/bin/bash
# Find how ArrayConcat is implemented for direct invocation
rg -n "impl.*ScalarUDFImpl.*for.*ArrayConcat" --type rust -A 20
# Find the array_concat kernel or function
rg -n "fn array_concat" --type rust -C5🤖 Prompt for AI Agents |
||||||||||||||||
|
|
||||||||||||||||
| let mut return_datatype = DataType::Utf8; | ||||||||||||||||
|
|
||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only
DataType::List(_)is checked; consider includingLargeList(_)andFixedSizeList(_)so all list-like arrays are recognized.🤖 Was this useful? React with 👍 or 👎
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes check only for the DataType::List type but it should also check the other List types - LargeList and FixedSizeList, otherwise the solution won't be complete.