Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 61 additions & 13 deletions datafusion/functions/src/string/concat.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@
// specific language governing permissions and limitations
// under the License.

use arrow::array::{Array, as_largestring_array};
use arrow::datatypes::DataType;
use arrow::array::{Array, as_largestring_array, ArrayRef};
use arrow::datatypes::{DataType, Field, TimeUnit};
use arrow_array::types::*;
use datafusion_expr::sort_properties::ExprProperties;
use std::any::Any;
use std::sync::Arc;
Expand All @@ -26,10 +27,10 @@ use crate::strings::{
ColumnarValueRef, LargeStringArrayBuilder, StringArrayBuilder, StringViewArrayBuilder,
};
use datafusion_common::cast::{as_string_array, as_string_view_array};
use datafusion_common::{Result, ScalarValue, internal_err, plan_err};
use datafusion_expr::expr::ScalarFunction;
use datafusion_common::{Result, ScalarValue, internal_err, plan_err, exec_err};
use datafusion_expr::expr::{ScalarFunction, ScalarFunctionExpr};
use datafusion_expr::simplify::{ExprSimplifyResult, SimplifyInfo};
use datafusion_expr::{ColumnarValue, Documentation, Expr, Volatility, lit};
use datafusion_expr::{ColumnarValue, Documentation, Expr, Volatility, lit, BuiltinScalarFunction};
use datafusion_expr::{ScalarFunctionArgs, ScalarUDFImpl, Signature};
use datafusion_macros::user_doc;

Expand Down Expand Up @@ -67,13 +68,42 @@ impl ConcatFunc {
pub fn new() -> Self {
use DataType::*;
Self {
signature: Signature::variadic(
vec![Utf8View, Utf8, LargeUtf8],
Volatility::Immutable,
signature: Signature::variadic_any(Volatility::Immutable)
),
}
}
}

// Check if any argument is an array type
fn has_array_args(args: &[ColumnarValue]) -> bool {
use arrow::datatypes::DataType::*;
args.iter().any(|arg| match arg {
ColumnarValue::Array(arr) => matches!(arr.data_type(), List(_)),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only DataType::List(_) is checked; consider including LargeList(_) and FixedSizeList(_) so all list-like arrays are recognized.

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

@martin-augment martin-augment Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes check only for the DataType::List type but it should also check the other List types - LargeList and FixedSizeList, otherwise the solution won't be complete.

ColumnarValue::Scalar(scalar) => matches!(scalar, ScalarValue::List(_, _)),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern won’t match: ScalarValue::List takes a single Arc<ListArray> (and there are separate LargeList/FixedSizeList variants). As written, scalar list arguments won’t be detected.

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes check only for the ScalarValue::List type but it should also check the other List types - LargeList and FixedSizeList, otherwise the solution won't be complete.

})
}
Comment on lines +76 to +83
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the ScalarValue enum definition
rg -n "enum ScalarValue" --type rust -A 50 | head -100

Repository: martin-augment/datafusion

Length of output: 3784


🏁 Script executed:

#!/bin/bash
# Search for ScalarValue::List pattern usage in the codebase
rg "ScalarValue::List\(" --type rust -B2 -A2 | head -50

Repository: martin-augment/datafusion

Length of output: 3735


🏁 Script executed:

#!/bin/bash
# Look specifically in arrow crate or where ScalarValue is defined
fd -e rs | xargs rg -l "enum ScalarValue" | head -5

Repository: martin-augment/datafusion

Length of output: 104


🏁 Script executed:

#!/bin/bash
# Get the full ScalarValue enum definition including List variant
sed -n '324,450p' datafusion/common/src/scalar/mod.rs

Repository: martin-augment/datafusion

Length of output: 5239


Fix incorrect pattern matching for ScalarValue::List.

The pattern ScalarValue::List(_, _) on line 81 is incorrect. The List variant takes a single Arc<ListArray> parameter, not two. Change it to ScalarValue::List(_).

🤖 Prompt for AI Agents
In datafusion/functions/src/string/concat.rs around lines 76 to 83, the pattern
match for ScalarValue is incorrect: it uses ScalarValue::List(_, _) but the List
variant takes a single argument; change the match arm to ScalarValue::List(_) so
it matches the single Arc<ListArray> parameter (leave the Array case as-is).


// Convert arguments to array_concat function
fn to_array_concat(args: Vec<Expr>) -> Result<Expr> {
let array_concat = BuiltinScalarFunction::ArrayConcat;
let args = args.into_iter()
.map(|arg| {
// If the argument is not already an array, wrap it in an array
if !matches!(arg.get_type(&DataType::Null).unwrap_or_default(), DataType::List(_)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using expr.get_type without a schema (&DataType::Null) won’t determine if arg is a list and risks wrapping list args in MakeArray, producing nested lists. You likely need schema‑aware type resolution or a different check here.

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed code does not even compile. The expected parameter is schema: &dyn ExprSchema, not &DataType.

Ok(Expr::ScalarFunction(ScalarFunction::new(
BuiltinScalarFunction::MakeArray,
vec![arg],
)))
} else {
Ok(arg)
}
})
.collect::<Result<Vec<_>>>()?;

Ok(Expr::ScalarFunction(ScalarFunction::new(
array_concat,
args,
)))
}
Comment on lines +85 to +106
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the actual file and understand the context
cd datafusion/functions/src/string
wc -l concat.rs

Repository: martin-augment/datafusion

Length of output: 82


🏁 Script executed:

# Read the specific lines mentioned in the review
sed -n '85,106p' concat.rs

Repository: martin-augment/datafusion

Length of output: 121


🏁 Script executed:

# Check if BuiltinScalarFunction exists and how it's used
rg -n "BuiltinScalarFunction" --type rust -C 2 datafusion/ | head -150

Repository: martin-augment/datafusion

Length of output: 1597


🏁 Script executed:

# Check Expr::get_type signature
rg -n "fn get_type" --type rust -A 3 datafusion/ | head -100

Repository: martin-augment/datafusion

Length of output: 2945


🏁 Script executed:

# Check ScalarFunction::new signature and usage
rg -n "ScalarFunction::new" --type rust -C 2 datafusion/ | head -100

Repository: martin-augment/datafusion

Length of output: 8058


🏁 Script executed:

# Check how array_concat is currently implemented
rg -n "array_concat" --type rust -C 3 datafusion/ | head -80

Repository: martin-augment/datafusion

Length of output: 6547


🏁 Script executed:

# Check for MakeArray implementation
rg -n "MakeArray" --type rust -C 3 datafusion/ | head -100

Repository: martin-augment/datafusion

Length of output: 2933


🏁 Script executed:

# Check the ScalarFunction::new signature
rg -n "fn new\(" --type rust datafusion/expr/src/expr/ -A 2 | grep -A 2 "ScalarFunction"

Repository: martin-augment/datafusion

Length of output: 134


🏁 Script executed:

# Verify if ScalarFunction::new still exists with BuiltinScalarFunction
rg -n "impl ScalarFunction" --type rust datafusion/expr/src/expr/ -A 10

Repository: martin-augment/datafusion

Length of output: 134


🏁 Script executed:

# Check what the correct approach should be - look at array_concat and make_array helper functions
rg -n "pub fn array_concat" --type rust datafusion/functions-nested/src/ -A 5

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check make_array helper function
rg -n "pub fn make_array" --type rust datafusion/functions-nested/src/ -A 5

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Find ScalarFunction definition
rg -n "struct ScalarFunction" --type rust -A 5 datafusion/

Repository: martin-augment/datafusion

Length of output: 1965


🏁 Script executed:

# Find the public helper functions in functions-nested
find datafusion/functions-nested/src -name "*.rs" -type f | head -20

Repository: martin-augment/datafusion

Length of output: 939


🏁 Script executed:

# Check expr_fn module for public helpers
rg -n "pub fn array_concat\|pub fn make_array" --type rust datafusion/functions-nested/

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check what's in expr_fn.rs
cat datafusion/functions-nested/src/expr_fn.rs | head -100

Repository: martin-augment/datafusion

Length of output: 143


🏁 Script executed:

# Find ScalarFunction impl and its methods
rg -n "impl ScalarFunction" --type rust -A 30 datafusion/expr/src/expr.rs

Repository: martin-augment/datafusion

Length of output: 1287


🏁 Script executed:

# Check functions-nested lib.rs for public exports
cat datafusion/functions-nested/src/lib.rs | head -150

Repository: martin-augment/datafusion

Length of output: 5356


🏁 Script executed:

# Check the macros to understand make_udf_expr_and_func
rg -n "make_udf_expr_and_func" --type rust -A 20 datafusion/functions-nested/src/macros.rs | head -80

Repository: martin-augment/datafusion

Length of output: 1936


🏁 Script executed:

# Double-check: search for any usage of ScalarFunction::new with non-new_udf pattern
rg -n "ScalarFunction::new\(" --type rust datafusion/expr/src/expr.rs

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Verify the old BuiltinScalarFunction is still in codebase but unused in new patterns
rg -n "enum BuiltinScalarFunction" --type rust datafusion/ -A 5

Repository: martin-augment/datafusion

Length of output: 51


This code will not compile—multiple critical API mismatches.

  1. Line 91: arg.get_type(&DataType::Null) has wrong signature. Expr::get_type() requires a &dyn ExprSchema parameter, not a &DataType.

  2. Lines 92-95, 102-105: ScalarFunction::new(BuiltinScalarFunction, args) does not exist. The only public constructor is ScalarFunction::new_udf(udf: Arc<ScalarUDF>, args).

  3. Lines 87, 93: BuiltinScalarFunction::ArrayConcat and BuiltinScalarFunction::MakeArray are not the correct types to use. Both functions are implemented as proper ScalarUDF instances in datafusion-functions-nested.

Fix: Use the public helper functions from datafusion_functions_nested::expr_fn:

  • Replace the entire to_array_concat implementation with calls to array_concat() and make_array() instead of manually constructing expressions with BuiltinScalarFunction and incorrect API calls.
🤖 Prompt for AI Agents
In datafusion/functions/src/string/concat.rs around lines 85 to 106, the current
to_array_concat implementation uses incorrect APIs (Expr::get_type signature,
ScalarFunction::new constructor, and BuiltinScalarFunction variants) and won't
compile; replace the entire function body to use the public helpers from
datafusion_functions_nested::expr_fn: call make_array(expr) to wrap non-list
arguments and then call array_concat(vec_of_exprs) to build the concatenation
expression (import the two helper functions), removing any direct use of
BuiltinScalarFunction or ScalarFunction::new and any incorrect get_type calls.


impl ScalarUDFImpl for ConcatFunc {
Comment on lines +106 to 108
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing closing brace for impl ConcatFunc block.

The impl ConcatFunc block starting at line 67 is not closed before the impl ScalarUDFImpl block begins at line 108. A closing } is needed after the to_array_concat function.

         Ok(Expr::ScalarFunction(ScalarFunction::new(
             array_concat,
             args,
         )))
     }
+}
 
 impl ScalarUDFImpl for ConcatFunc {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
}
impl ScalarUDFImpl for ConcatFunc {
}
}
impl ScalarUDFImpl for ConcatFunc {
🤖 Prompt for AI Agents
In datafusion/functions/src/string/concat.rs around lines 106 to 108, the impl
ConcatFunc block opened earlier (around line 67) is missing its closing brace;
add a single closing curly brace `}` immediately after the to_array_concat
function to terminate the impl ConcatFunc block before the subsequent impl
ScalarUDFImpl begins so the file compiles and the impl blocks are properly
separated.

fn as_any(&self) -> &dyn Any {
Expand All @@ -90,22 +120,40 @@ impl ScalarUDFImpl for ConcatFunc {

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
use DataType::*;

// If any argument is an array, return the array type
if arg_types.iter().any(|t| matches!(t, List(_))) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array detection in return_type only checks List(_); it should likely also handle LargeList(_) and FixedSizeList(_) to return the correct array type. (Also applies to the if let Some(DataType::List(...)) below.)

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The proposed changes check only for the DataType::List type but it should also check the other List types - LargeList and FixedSizeList, otherwise the solution won't be complete.

// Find the first array type to use as the base
if let Some(DataType::List(field)) = arg_types.iter().find(|t| matches!(t, List(_))) {
return Ok(DataType::List(field.clone()));
}
}

// Otherwise, use the existing string type logic
let mut dt = &Utf8;
arg_types.iter().for_each(|data_type| {
for data_type in arg_types {
if data_type == &Utf8View {
dt = data_type;
}
if data_type == &LargeUtf8 && dt != &Utf8View {
dt = data_type;
}
});

Ok(dt.to_owned())
}
Ok(dt.clone())
}
Comment on lines 121 to 143
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the concat.rs file more closely to understand the context
cd datafusion/functions/src/string && wc -l concat.rs

Repository: martin-augment/datafusion

Length of output: 82


🏁 Script executed:

# Check the full concat function implementation to understand the context
cd datafusion/functions/src/string && head -160 concat.rs | tail -60

Repository: martin-augment/datafusion

Length of output: 2014


🏁 Script executed:

# Search for tests related to array concatenation
find . -name "*.rs" -type f -exec grep -l "concat.*List\|array.*concat" {} \; | head -20

Repository: martin-augment/datafusion

Length of output: 726


🏁 Script executed:

# Look for type coercion logic in DataFusion
rg "coerce.*type|type.*coerce" --type rs -A 3 -B 3 | head -100

Repository: martin-augment/datafusion

Length of output: 95


🌐 Web query:

DataFusion concat array type coercion behavior

💡 Result:

Short answer: array_concat (and the array/list concat UDFs) require the input arrays to have element types that can be coerced to a single common element type; DataFusion will pick a common promoted/compatible Arrow DataType and (implicitly) cast elements as needed using DataFusion/Arrow cast rules. If types are not coercible the planner will error. Note also the SQL concat operator (string concat) is different — when given list/array arguments it may stringify them instead of performing array concatenation; use array_concat / list_concat for real array merge. [1][2][3][4]

Sources:

  • array/list function docs (array_concat signature and description). [1]
  • type-coercion / function-signature notes (arrays must be coercible to same inner type). [2]
  • bug/discussion showing concat vs array_concat behavior (arrays being stringified). [3]
  • Arrow/DataFusion cast behavior (how casts handle Utf8⇄numeric/boolean and list casts). [4]

🏁 Script executed:

# Check the proper array_concat implementation in functions-nested
cat datafusion/functions-nested/src/concat.rs | head -200

Repository: martin-augment/datafusion

Length of output: 5941


🏁 Script executed:

# Look for type coercion logic in the planner
rg "coerce_types_for_function|type_coerce" --type rs -A 5 | head -150

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Search for tests related to array concatenation
rg "array_concat|concat.*List" --type rs tests/ | head -30

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Check if there are integration tests for concat with arrays
find . -name "*.rs" -path "*/tests/*" -exec grep -l "array_concat\|concat.*\[" {} \; | head -10

Repository: martin-augment/datafusion

Length of output: 363


🏁 Script executed:

# Get the array_concat implementation in functions-nested
sed -n '200,400p' datafusion/functions-nested/src/concat.rs

Repository: martin-augment/datafusion

Length of output: 5908


🏁 Script executed:

# Search for ArrayConcat struct and return_type implementation
rg "struct ArrayConcat|impl.*ArrayConcat" --type rs -A 30 datafusion/functions-nested/src/concat.rs | head -80

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Get the return_type method from array_concat
rg "fn return_type" --type rs -B 5 -A 20 datafusion/functions-nested/src/concat.rs

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Look at how type coercion is done for arrays
rg "ListCoercion|coerce_types_for_function" --type rs -B 3 -A 10 | head -100

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Get the full string concat implementation
cat datafusion/functions/src/string/concat.rs | head -250

Repository: martin-augment/datafusion

Length of output: 9541


🏁 Script executed:

# Look at the to_array_concat method and how it's used
rg "to_array_concat|has_array_args" --type rs -B 5 -A 15 datafusion/functions/src/string/concat.rs

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Check if there's type coercion happening in the planner for this
rg "concat.*coerce|string.*concat.*type" --type rs datafusion/ | head -20

Repository: martin-augment/datafusion

Length of output: 95


Type signature mismatch: return_type() doesn't match array_concat's actual type coercion behavior.

The return_type() method returns the first List field found without type unification. When delegating to array_concat() in invoke_with_args(), the actual return type may differ due to array_concat's proper type promotion (via type_union_resolution()). For example, concat(List<Int32>, List<Int64>) would declare return type List<Int32> but actually return List<Int64>, causing type signature mismatches. Use type_union_resolution() in return_type() to match array_concat's behavior, similar to the proper implementation in functions-nested.

🤖 Prompt for AI Agents
In datafusion/functions/src/string/concat.rs around lines 121 to 143, the
return_type() currently picks the first List field found which can disagree with
array_concat()'s actual promoted return type; instead collect the element
DataType of every List(_) argument, call type_union_resolution() (the same
promotion used by array_concat) on those element types to produce a unified
field, and return DataType::List(unified_field); if resolution fails propagate
the error. Keep the existing non-list string promotion logic unchanged for the
Utf8/LargeUtf8 path.


/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
/// concat('abcde', 2, NULL, 22) = 'abcde222'
fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
// Check if any argument is an array
if Self::has_array_args(&args.args) {
// Convert to array_concat expression and evaluate it
let exprs: Vec<Expr> = args.args.into_iter().map(|a| a.into_expr()).collect();
let expr = Self::to_array_concat(exprs)?;
return expr.eval(args.context);
}

// Original string concatenation logic
let ScalarFunctionArgs { args, .. } = args;
Comment on lines 147 to 157
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fundamental design issue: Cannot convert ColumnarValue to Expr for evaluation.

This approach has critical flaws:

  1. Line 151: ColumnarValue::into_expr() does not exist. ColumnarValue represents runtime columnar data (arrays/scalars), while Expr represents logical expressions. They cannot be directly converted.

  2. Line 153: Expr::eval(args.context) is not a valid API. Expressions are evaluated through the execution engine, not directly.

  3. Conceptual issue: invoke_with_args receives already-evaluated values. The correct approach is to call the array concatenation logic directly on the ColumnarValue arguments, not attempt to rebuild and re-evaluate expressions.

Consider instead:

  • Import and call the array_concat kernel/function directly on the ColumnarValue arguments
  • Or delegate to the existing ArrayConcat UDF's invoke method
#!/bin/bash
# Find how ArrayConcat is implemented for direct invocation
rg -n "impl.*ScalarUDFImpl.*for.*ArrayConcat" --type rust -A 20

# Find the array_concat kernel or function
rg -n "fn array_concat" --type rust -C5
🤖 Prompt for AI Agents
In datafusion/functions/src/string/concat.rs around lines 147 to 157, do not try
to turn ColumnarValue into Expr or call Expr::eval; instead detect array
arguments at runtime and call the array-concatenation implementation directly on
the ColumnarValue inputs (or delegate to the existing ArrayConcat UDF
implementation). Concretely: remove the into_expr() map and
to_array_concat()/eval flow, match the incoming ScalarFunctionArgs.args for
ColumnarValue::Array/Scalar variants, call the array_concat kernel or
ArrayConcat.invoke_with_args (importing the module if needed) with the same
context/args, and return its ColumnarValue result; ensure errors are propagated
and types are validated before calling the kernel.


let mut return_datatype = DataType::Utf8;
Expand Down